jorgecarleitao / json-deserializer Goto Github PK
View Code? Open in Web Editor NEWLicense: Apache License 2.0
License: Apache License 2.0
Hi,
first of all, this is a really impressive creation I love the trick of delaying number parsing darn smart :D.
That said I would suggest enabling SIMD when comparing against simd-json
, otherwise, it's a bit of a pointless comparison. The best way to do that in a non too CPU-dependent way is to use RUSTFLAGS="-C target-feature=+avx,+avx2,+sse4.2"
those features are present on all modern CPUs.
Hi, thanks for this crate (currently using it as part of Polars)! It looks like Cargo.toml
specifies the license as Apache-2.0
, but it'd be nice to include a license file in the repo (which Cargo should automatically include in the published crate if present).
I also noticed you're the owner of the arrow-format
crate - would be nice to do the same there.
Issue jorgecarleitao/arrow2#1426 is caused by the fact that values like 2e-42
are parsed as Number::Integer(b"2", b"-42")
here, causing problems downstream.
Originally reported via pola-rs/polars#8424
[dependencies]
json-deserializer = "0.4.4"
use json_deserializer::Value;
fn main() {
let expected = "你好,polars。\n";
let json_data = format!(r#""你好,polars。\n""#);
let bytes = json_data.as_bytes();
let v = json_deserializer::parse(&bytes).unwrap();
if let Value::String(s) = v {
assert_eq!(s.as_ref(), expected)
}
}
thread 'main' panicked at 'assertion failed: `(left == right)`
left: `"ä½\u{a0}好ï¼\u{8c}polarsã\u{80}\u{82}\n"`,
right: `"你好,polars。\n"`', src/main.rs:13:9
I was testing potential JSON libraries to use and I noticed several issues from json.org's JSON Checker:
["Extra close"]]
{"Comma instead if closing brace": true,
This library looks great and is quite efficient!
This was first reported in polars. We are using arrow2
to infer the schema for ndjson data types, which looks like it is using this to crate deserialize json in the implementation of arrow2::io::ndjson::read::infer
Number is parsed correctly
serde_json
will parse as so: {"Value": Number(11000000000.0)}
json_deserializer errors out with MissingComa(43)
when trying to parse scientific notation with +
symbol.
fn main() {
let mut s = br#"{"Value":1.1e+10}"#.to_vec();
let v = json_deserializer::parse(&mut s);
println!("{:?}", v);
}
// Err(MissingComa(43))
Note: the -
symbol parses as expected.
fn main() {
let mut s = br#"{"Value":1.1e-10}"#.to_vec();
let v = json_deserializer::parse(&mut s);
println!("{:?}", v);
}
// Ok(Object({"Value": Number(Float([49, 46, 49], [45, 49, 48]))}))
Awesome work as usual @jorgecarleitao.
I was curious what the performance would be with serde_json's in-place deserialization so I created a custom deserializer for parser::Value
. I had to duplicate parser::Value
to use converted numbers since it wasn't possible to use this library's lazy numeric parser without replicating all of serde_json's parsing code.
The code for the custom deserializer is here. I updated the benchmarks to include this implementation, and also added a test for escaped strings.
Results from my macbook are below. serde_json_custom
is the benchmark for the custom deserializer. string_escaped_chars
is the benchmark for escaped strings.
string json_deserializer 2^10 time: [63.949 us 64.339 us 64.739 us]
string serde_json 2^10 time: [100.51 us 101.21 us 101.94 us]
string serde_json_custom 2^10 time: [41.162 us 41.491 us 41.923 us]
string simd_json 2^10 time: [25.250 us 25.574 us 25.932 us]
string_escaped_chars json_deserializer 2^10 time: [193.93 us 195.39 us 196.92 us]
string_escaped_chars serde_json 2^10 time: [130.26 us 131.22 us 132.27 us]
string_escaped_chars serde_json_custom 2^10 time: [137.42 us 138.47 us 139.55 us]
string_escaped_chars simd_json 2^10 time: [46.427 us 46.821 us 47.252 us]
I was surprised at the results, especially since nothing stuck out from inspecting the code. So I dug in further and it seems to come to the relatively large match expression in compute_length. Reworking it improved up the numbers:
string json_deserializer 2^10 time: [39.574 us 39.719 us 39.868 us]
string serde_json 2^10 time: [103.74 us 104.84 us 105.99 us]
string serde_json_custom 2^10 time: [46.926 us 48.188 us 49.584 us]
string simd_json 2^10 time: [27.517 us 27.786 us 28.105 us]
string_escaped_chars json_deserializer 2^10 time: [144.19 us 145.30 us 146.43 us]
string_escaped_chars serde_json 2^10 time: [143.75 us 144.64 us 145.52 us]
string_escaped_chars serde_json_custom 2^10 time: [145.71 us 147.09 us 148.73 us]
string_escaped_chars simd_json 2^10 time: [47.152 us 47.652 us 48.162 us]
If you're able to replicate the results and interested in taking the changes can open a PR. Changes are in this branch. Thanks!
As discussed in #6
I was trying to port the polars ndjson
reader over to this crate instead of serde_json
, and noticed a huge degradation in performance after the refactor.
On a dataset of 500_000
records. containing data similar to this: (more than happy to share the actual file if needed)
{
"number": 300,
"hash": "0xb3e37f7c14742bc54d08163792d38ada69c3951817b8dde6ef96776aa5c0f00c",
"parent_hash": "0x989b8bf2af0be6c18c9c95bfde81492e0b47bcc1c26d555bb7cea2d09e92c6c3",
"nonce": "0x424b554fa4a7a04f",
"sha3_uncles": "0x1dcc4de8dec75d7aab85b567b6ccd41ad312451b948a7413f0a142fd40d49347",
"logs_bloom": "0x00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000",
"transactions_root": "0x56e81f171bcc55a6ff8345e692c0f86e5b48e01b996cadc001622fb5e363b421",
"state_root": "0x34e5b52497408cd2bbcb6992dee0292498a235ec7aca1b34f6cbccb396f85105",
"receipts_root": "0x56e81f171bcc55a6ff8345e692c0f86e5b48e01b996cadc001622fb5e363b421",
"miner": "0xbb7b8287f3f0a933474a79eae42cbca977791171",
"difficulty": 19753900789,
"total_difficulty": 5531721283386,
"size": 544,
"extra_data": "0x476574682f4c5649562f76312e302e302f6c696e75782f676f312e342e32",
"gas_limit": 5000,
"gas_used": 0,
"timestamp": 1438270848,
"transaction_count": 0,
"base_fee_per_gas": null
}
An unoptimized build using serde_json
will load the json file in ~35s
on my device.
The unoptimized build using json_serializer
loads in ~55s
The code changes are available here just in case I did something wrong on my end in using json_serializer
fn parse_lines<'a>(bytes: &[u8], buffers: &mut PlIndexMap<String, Buffer<'a>>) -> Result<usize> {
let lines: SplitLines = SplitLines::new(bytes, NEWLINE);
for mut line in lines {
let value = json_deserializer::parse(&mut line).map_err(|e| {
PolarsError::ComputeError(format!("unable to parse line: {}", e).into())
})?;
match value {
Value::Object(value) => {
buffers
.iter_mut()
.for_each(|(s, inner)| match value.get(s) {
Some(v) => inner.add(v).expect("inner.add(v)"),
None => inner.add_null(),
});
}
_ => {
buffers.iter_mut().for_each(|(_, inner)| inner.add_null());
}
}
}
let bytes_read = bytes.len();
Ok(bytes_read)
}
The parser doesn't expose a way to get the number of bytes read back so the SplitLines
does cause some overhead. I copied over the code to call the parse_value
function directly, and created an optimized loop that doesnt use the custom SplitLines
iterator. I saw some perf gains ~5s
, but it is still very far off from the serde implementation.
the more optimized version:
// it's ndjson, so we know it is always should be {obj}\n{obj}\n
fn parse_lines<'a>(
bytes: &mut &[u8],
buffers: &mut PlIndexMap<String, Buffer<'a>>,
) -> Result<usize> {
use json_deserialize::Value;
let total_bytes = bytes.len();
let mut read = 0;
loop {
if read == total_bytes {
break;
}
// its a newline
if json_deserialize::current_token(bytes)
.map_err(|e| PolarsError::ComputeError(format!("EOF error: {}", e).into()))?
== NEWLINE
{
read += 1;
*bytes = &bytes[1..];
}
// its the object
else {
let byte_size = bytes.len();
let value = json_deserialize::parse_value(bytes).map_err(|e| {
PolarsError::ComputeError(format!("unable to parse line: {}", e).into())
})?;
read += byte_size - bytes.len();
match value {
Value::Object(value) => {
buffers
.iter_mut()
.for_each(|(s, inner)| match value.get(s) {
Some(v) => inner.add(v).expect("inner.add(v)"),
None => inner.add_null(),
});
}
_ => {
buffers.iter_mut().for_each(|(_, inner)| inner.add_null());
}
};
}
}
Ok(total_bytes)
}
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.