Coder Social home page Coder Social logo

crazyxman / simdjson_php Goto Github PK

View Code? Open in Web Editor NEW
160.0 9.0 12.0 4.15 MB

simdjson_php bindings for the simdjson project. https://github.com/lemire/simdjson

License: Apache License 2.0

M4 0.07% C 0.48% C++ 90.61% PHP 8.77% CMake 0.04% JavaScript 0.03%
simdjson php json json-parser

simdjson_php's Introduction

simdjson_php

simdjson_php bindings for the simdjson project.

Build Status Build Status (Windows)

Requirement

  • PHP 7.0+ (The latest php version was 8.2 at the time of writing)
  • Prerequisites: g++ (version 7 or better) or clang++ (version 6 or better), and a 64-bit system with a command-line shell (e.g., Linux, macOS, freeBSD). We also support programming environments like Visual Studio and Xcode, but different steps are needed

Installing

Linux

simdjson may also be installed with the command pecl install simdjson (You will need to enable simdjson in php.ini)

Alternately, you may wish to build from source.

MacOS

pecl install simdjson is the recommended installation method (You will need to enable simdjson in php.ini)

Alternately, you may wish to build from source.

Installing on Windows

Prebuilt DLLs can be downloaded from PECL once the PHP for Windows team fixes hardware issues.

See https://wiki.php.net/internals/windows/stepbystepbuild_sdk_2#building_pecl_extensions and .appveyor.yml for how to build this, in the meantime.

Compile simdjson_php in Linux

$ phpize
$ ./configure
$ make
$ make test
$ make install

Add the following line to your php.ini

extension=simdjson.so

simdjson_php Usage

$jsonString = <<<'JSON'
{
  "Image": {
    "Width":  800,
    "Height": 600,
    "Title":  "View from 15th Floor",
    "Thumbnail": {
      "Url":    "http://www.example.com/image/481989943",
      "Height": 125,
      "Width":  100
    },
    "Animated" : false,
    "IDs": [116, 943, 234, 38793, {"p": "30"}]
  }
}
JSON;

// Check if a JSON string is valid:
$isValid = simdjson_is_valid($jsonString); //return bool
var_dump($isValid);  // true

// Parsing a JSON string. similar to the json_decode() function but without the fourth argument
try {
    // returns array|stdClass|string|float|int|bool|null.
    $parsedJSON = simdjson_decode($jsonString, true, 512);
    var_dump($parsedJSON); // PHP array
} catch (RuntimeException $e) {
    echo "Failed to parse $jsonString: {$e->getMessage()}\n";
}

// note. "/" is a separator. Can be used as the "key" of the object and the "index" of the array
// E.g. "/Image/Thumbnail/Url" is recommended starting in simdjson 4.0.0,
// but "Image/Thumbnail/Url" is accepted for now.

// get the value of a "key" in a json string
// (before simdjson 4.0.0, the recommended leading "/" had to be omitted)
$value = simdjson_key_value($jsonString, "/Image/Thumbnail/Url");
var_dump($value); // string(38) "http://www.example.com/image/481989943"

$value = simdjson_key_value($jsonString, "/Image/IDs/4", true);
var_dump($value);
/*
array(1) {
  ["p"]=>
  string(2) "30"
}
*/

// check if the key exists. return true|false|null. "true" exists, "false" does not exist,
// throws for invalid JSON.
$res = simdjson_key_exists($jsonString, "/Image/IDs/1");
var_dump($res) //bool(true)

// count the values
$res = simdjson_key_count($jsonString, "/Image/IDs");
var_dump($res) //int(5)

simdjson_php API

<?php

/**
 * Takes a JSON encoded string and converts it into a PHP variable.
 * Similar to json_decode()
 *
 * @param string $json The JSON string being decoded
 * @param bool $associative When true, JSON objects will be returned as associative arrays.
 *                          When false, JSON objects will be returned as objects.
 * @param int $depth the maximum nesting depth of the structure being decoded.
 * @return array|stdClass|string|float|int|bool|null
 * @throws SimdJsonException for invalid JSON
 *                           (or $json over 4GB long, or out of range integer/float)
 * @throws SimdJsonValueError for invalid $depth
 */
function simdjson_decode(string $json, bool $associative = false, int $depth = 512) {}

/**
 * Returns true if json is valid.
 *
 * @param string $json The JSON string being decoded
 * @param int $depth the maximum nesting depth of the structure being decoded.
 * @return bool
 * @throws SimdJsonValueError for invalid $depth
 */
function simdjson_is_valid(string $json, int $depth = 512) : bool {}

/**
 * Parses $json and returns the number of keys in $json matching the JSON pointer $key
 *
 * @param string $json The JSON string being decoded
 * @param string $key The JSON pointer being requested
 * @param int $depth The maximum nesting depth of the structure being decoded.
 * @param bool $throw_if_uncountable If true, then throw SimdJsonException instead of
                                     returning 0 for JSON pointers
                                     to values that are neither objects nor arrays.
 * @return int
 * @throws SimdJsonException for invalid JSON or invalid JSON pointer
 *                           (or document over 4GB, or out of range integer/float)
 * @throws SimdJsonValueError for invalid $depth
 * @see https://www.rfc-editor.org/rfc/rfc6901.html
 */
function simdjson_key_count(string $json, string $key, int $depth = 512, bool $throw_if_uncountable = false) : int {}

/**
 * Returns true if the JSON pointer $key could be found.
 *
 * @param string $json The JSON string being decoded
 * @param string $key The JSON pointer being requested
 * @param int $depth the maximum nesting depth of the structure being decoded.
 * @return bool (false if key is not found)
 * @throws SimdJsonException for invalid JSON or invalid JSON pointer
 *                           (or document over 4GB, or out of range integer/float)
 * @throws SimdJsonValueError for invalid $depth
 * @see https://www.rfc-editor.org/rfc/rfc6901.html
 */
function simdjson_key_exists(string $json, string $key, int $depth = 512) : bool {}

/**
 * Returns the value at the json pointer $key
 *
 * @param string $json The JSON string being decoded
 * @param string $key The JSON pointer being requested
 * @param int $depth the maximum nesting depth of the structure being decoded.
 * @param bool $associative When true, JSON objects will be returned as associative arrays.
 *                          When false, JSON objects will be returned as objects.
 * @return array|stdClass|string|float|int|bool|null the value at $key
 * @throws SimdJsonException for invalid JSON or invalid JSON pointer
 *                           (or document over 4GB, or out of range integer/float)
 * @throws SimdJsonValueError for invalid $depth
 * @see https://www.rfc-editor.org/rfc/rfc6901.html
 */
function simdjson_key_value(string $json, string $key, bool $associative = false, int $depth = 512) {}

/**
 * An error thrown by simdjson when processing json.
 *
 * The error code is available as $e->getCode().
 * This can be compared against the `SIMDJSON_ERR_*` constants.
 *
 * Before simdjson 2.1.0, a regular RuntimeException with an error code of 0 was thrown.
 */
class SimdJsonException extends RuntimeException {
}

/**
 * Thrown for error conditions on fields such as $depth that are not expected to be
 * from user-provided JSON, with similar behavior to php 8.0.
 *
 * NOTE: https://www.php.net/valueerror was added in php 8.0.
 * In older php versions, this extends Error instead.
 *
 * When support for php 8.0 is dropped completely,
 * a major release of simdjson will likely switch to a standard ValueError.
 */
class SimdJsonValueError extends ValueError {
}

Edge cases

There are some differences from json_decode() due to the implementation of the underlying simdjson library. This will throw a RuntimeException if simdjson rejects the JSON.

Note that the simdjson PECL is using a fork of the simdjson C library to imitate php's handling of integers and floats in JSON.

  1. Until simdjson 2.1.0, simdjson_decode() differed in how out of range 64-bit integers and floats are handled.

See https://github.com/simdjson/simdjson/blob/master/doc/basics.md#standard-compliance

  • The specification allows implementations to set limits on the range and precision of numbers accepted. We support 64-bit floating-point numbers as well as integer values.
    • We parse integers and floating-point numbers as separate types which allows us to support all signed (two's complement) 64-bit integers, like a Java long or a C/C++ long long and all 64-bit unsigned integers. When we cannot represent exactly an integer as a signed or unsigned 64-bit value, we reject the JSON document.
    • We support the full range of 64-bit floating-point numbers (binary64). The values range from std::numeric_limits<double>::lowest() to std::numeric_limits<double>::max(), so from -1.7976e308 all the way to 1.7975e308. Extreme values (less or equal to -1e308, greater or equal to 1e308) are rejected: we refuse to parse the input document. Numbers are parsed with a perfect accuracy (ULP 0): the nearest floating-point value is chosen, rounding to even when needed. If you serialized your floating-point numbers with 17 significant digits in a standard compliant manner, the simdjson library is guaranteed to recover the same numbers, exactly.
  1. The maximum string length that can be passed to simdjson_decode() is 4GiB (4294967295 bytes). json_decode() can decode longer strings.

  2. The handling of max depth is counted slightly differently for empty vs non-empty objects/arrays. In json_decode, an array with a scalar has the same depth as an array with no elements. In simdjson_decode, an array with a scalar is one level deeper than an array with no elements. For typical use cases, this shouldn't matter. (e.g. simdjson_decode('[[]]', true, 2) will succeed but json_decode('[[]]', true, 2) and simdjson_decode('[[1]]', true, 2) will fail.)

Benchmarks

See the benchmark folder for more benchmarks.

simdjson_php's People

Contributors

crazyxman avatar remicollet avatar sandrokeil avatar tysonandre avatar weaponman avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

simdjson_php's Issues

Build broke on Alpine Linux

from #5 (comment)

I cannot build the latest commit under Alpine Linux 3.10, getting these errors:

$ make
/bin/sh /root/simdjson_php/libtool --mode=compile cc  -I. -I/root/simdjson_php -DPHP_ATOM_INC -I/root/simdjson_php/include -I/root/simdjson_php/main -I/root/simdjson_php -I/usr/include/php7 -I/usr/include/php7/main -I/usr/include/php7/TSRM -I/usr/include/php7/Zend -I/usr/include/php7/ext -I/usr/include/php7/ext/date/lib  -DHAVE_CONFIG_H  -g -O2   -c /root/simdjson_php/simdjson.c -o simdjson.lo 
mkdir .libs
 cc -I. -I/root/simdjson_php -DPHP_ATOM_INC -I/root/simdjson_php/include -I/root/simdjson_php/main -I/root/simdjson_php -I/usr/include/php7 -I/usr/include/php7/main -I/usr/include/php7/TSRM -I/usr/include/php7/Zend -I/usr/include/php7/ext -I/usr/include/php7/ext/date/lib -DHAVE_CONFIG_H -g -O2 -c /root/simdjson_php/simdjson.c  -fPIC -DPIC -o .libs/simdjson.o
In file included from /root/simdjson_php/simdjson.c:23:
/root/simdjson_php/src/bindings.h:25:1: error: unknown type name 'namespace'; did you mean 'isspace'?
 namespace simdjsonphp {
 ^~~~~~~~~
 isspace
/root/simdjson_php/src/bindings.h:25:23: error: expected '=', ',', ';', 'asm' or '__attribute__' before '{' token
 namespace simdjsonphp {
                       ^
/root/simdjson_php/src/bindings.h:41:1: error: unknown type name 'bool'; did you mean '_Bool'?
 bool cplus_simdjson_isvalid(const char *json);
 ^~~~
 _Bool
/root/simdjson_php/simdjson.c:56:22: error: conflicting types for 'cplus_simdjson_isvalid'
 extern unsigned char cplus_simdjson_isvalid(const char *json);
                      ^~~~~~~~~~~~~~~~~~~~~~
In file included from /root/simdjson_php/simdjson.c:23:
/root/simdjson_php/src/bindings.h:41:6: note: previous declaration of 'cplus_simdjson_isvalid' was here
 bool cplus_simdjson_isvalid(const char *json);
      ^~~~~~~~~~~~~~~~~~~~~~
make: *** [Makefile:193: simdjson.lo] Error 1

Commit a0e942d from May 29th builds fine.

If you want to try to reproduce, I started from a clean alpine:3.10 Docker container and only needed to install these packages: apk add git build-base php7-dev

Ignore pathologically large depths such as `simdjson_decode('{}', true, 1000000000)`

When the depth is both larger than the string length (guaranteeing depth limit won't be hit) and larger than a cutoff such as 100000, ignore it and use a smaller byte length value such as

  • the existing parser depth, if that is >= the string length
  • 100000, if that's sufficient
  • strlen * 2 (or the requested depth), if that's sufficient (adding extra space to make reallocations less frequent if the application calls this again with the same depth but varying string sizes)

The fact that memory is allocated for depth is different from json_decode, so users might not expect the high memory usage when attempting to avoid depth errors

Expose functions using zend_smart_str for PECLs calling simdjson_php bindings

  /** ...
   * @param realloc_if_needed Whether to reallocate and enlarge the JSON buffer to add padding.
   * @return An element pointing at the root of the document, or an error:
   *         - MEMALLOC if realloc_if_needed is true or the parser does not have enough capacity,
   *           and memory allocation fails.
   *         - CAPACITY if the parser does not have enough capacity and len > max_capacity.
   *         - other json errors if parsing fails. You should not rely on these errors to always the same for the
   *           same document: they may vary under runtime dispatch (so they may vary depending on your system and hardware).
   */
  inline simdjson_result<element> parse(const uint8_t *buf, size_t len, bool realloc_if_needed = true) & noexcept;

If the capacity is at least 64 bytes larger than the string size, it's safe to avoid the extra copy and set realloc_if_needed=false, as long as the 64 bytes are initialized memory (any value) (not sure which architectures this matters for)

Limit worst-case memory for string/depth buffers after simdjson call completes

E.g. if there's a long-running CLI application that calls simdjson_decode() once on startup, for a 100 megabyte long string, then avoid keeping around the 100 megabyte buffer that was allocated for the C simdjson parser after it is finished being used.

This would make it easier to reason about the worst-case impact of simdjson on PHP's memory usage.

If a buffer would be longer than a certain threshold (e.g. 100KB), then use a different short-lived instance (so that the same parser/buffer can be reused when calling simdjson again on small json blobs)

Option for float 4 bit, 8bit, 16bit, 32bit

Can you adjust the decoder to read faster, probably 4bit is a lot faster then the default 64bit, so we can choose the datatype. For computation of embeddings for example, also it compress better, keep the filesize smaller.

Avoid static variables, make this work in ZTS (thread safe) php builds

// src/bindings.cpp
// see https://github.com/simdjson/simdjson/blob/master/doc/performance.md#reusing-the-parser-for-maximum-efficiency
simdjson::dom::parser parser;

Look at php-src/ext/session/session.c and php_session.h for an example of this in ps_globals, PHP_GINIT_FUNCTION, PS() macro, etc.
GSHUTDOWN may be needed for the destructor in ZTS builds.

Avoid multiple threads using the same parser instance at the same time.

Consider making this request-local in RINIT instead of GINIT to avoid holding on to memory after it spikes for a single request (e.g. if parsing a really long json document once in httpd)

Idea: simdjson_key_info(): ?array

  • Support distinguishing between object/array/null - simdjson_key_count returns 0 for non-objects/non-arrays

E.g. return an array with fields type and count or null if the key does not exist

Consider setting up CI tests

It should be possible to use circleci or drone to setup tests. At a minimum, one should check that the install process works.

Idea: Fall back to php's json_decode internally in simdjson_decode for json length > 4 GiB

(instead of throwing SimdJsonException for the capacity error)

simdjson_decode is meant to have indistinguishable behavior from json_decode apart from error codes/classes/messages, so this should be safe

  • Use the JsonException message and a distinct SIMDJSON_ERR* error code

Libraries using simdjson_decode may be forced to add extra error handling if they expect to be used for JSON > 4 GiB by any application using them. It'd be more convenient to avoid that.

Low priority compared to other tasks due to most php installations having max upload sizes

Add `simdjson_free_memory(): int`

This may be useful if applications parse megabytes of json on startup, but don't use simdjson for large requests later on.

This would free the buffers used by the underlying C simdjson library

Cast uint64 to double instead, it exceeds INT64_MAX

  • Cast uint64 to double instead
// src/simdjson.h
  // Write unsigned if it doesn't fit in a signed integer.
  if (i > uint64_t(INT64_MAX)) {
    WRITE_UNSIGNED(i, src, writer);
  } else {
    WRITE_INTEGER(negative ? (~i+1) : i, src, writer);
  }
php > var_export(simdjson_decode('9223372036854775808'));
-9223372036854775807-1
php > var_export(simdjson_decode('9223372036854775807'));                                                                                                              
9223372036854775807

Enable AppVeyor builds

.appveyor.yml is checked in but not enabled. This is building successfully and enabled on my fork - https://ci.appveyor.com/project/TysonAndre/simdjson-php/history

This would detect compilation errors on Windows

EDIT: It builds as https://ci.appveyor.com/project/TysonAndre/crazyxman-simdjson-php for the crazyxman/simdjson_php repo, but because I'm a collaborator, it won't create web hooks that trigger when PRs are updated, and I don't have access to project settings (through oauth or that)? https://help.appveyor.com/discussions/problems/5554-if-a-pull-request-is-open-before-adding-project-to-appveyor-even-new-commits-dont-trigger-builds

simdjson_key_count should properly return sizes larger than 0xFFFFFF(16777215)

If the value of size() is 0xFFFFFF then iterate over the object/array instead to count the keys

Add test cases.

// class simdjson::dom::array 
  /**
   * Get the size of the array (number of immediate children).
   * It is a saturated value with a maximum of 0xFFFFFF: if the value
   * is 0xFFFFFF then the size is 0xFFFFFF or greater.
   */
  inline size_t size() const noexcept;

// class simdjson::dom::object
  /**
   * Get the size of the object (number of keys).
   * It is a saturated value with a maximum of 0xFFFFFF: if the value
   * is 0xFFFFFF then the size is 0xFFFFFF or greater.
   */
  inline size_t size() const noexcept;

Consider upgrading to simdjson 0.4.0

Version 0.4 of simdjson is now available

Highlights

  • Test coverage has been greatly improved and we have resolved many static-analysis warnings on different systems.

New features:

  • We added a fast (8GB/s) minifier that works directly on JSON strings.
  • We added fast (10GB/s) UTF-8 validator that works directly on strings (any strings, including non-JSON).
  • The array and object elements have a constant-time size() method.

Performance:

  • Performance improvements to the API (type(), get<>()).
  • The parse_many function (ndjson) has been entirely reworked. It now uses a single secondary thread instead of several new threads.
  • We have introduced a faster UTF-8 validation algorithm (lookup3) for all kernels (ARM, x64 SSE, x64 AVX).

System support:

  • C++11 support for older compilers and systems.
  • FreeBSD support (and tests).
  • We support the clang front-end compiler (clangcl) under Visual Studio.
  • It is now possible to target ARM platforms under Visual Studio.
  • The simdjson library will never abort or print to standard output/error.

Version 0.3 of simdjson is now available

Highlights

  • Multi-Document Parsing: Read a bundle of JSON documents (ndjson) 2-4x faster than doing it individually. API docs / Design Details
  • Simplified API: The API has been completely revamped for ease of use, including a new JSON navigation API and fluent support for error code and exception styles of error handling with a single API. Docs
  • Exact Float Parsing: Now simdjson parses floats flawlessly without any performance loss (simdjson/simdjson#558).
    Blog Post
  • Even Faster: The fastest parser got faster! With a shiny new UTF-8 validator
    and meticulously refactored SIMD core, simdjson 0.3 is 15% faster than before, running at 2.5 GB/s (where 0.2 ran at 2.2 GB/s).

Minor Highlights

  • Fallback implementation: simdjson now has a non-SIMD fallback implementation, and can run even on very old 64-bit machines.
  • Automatic allocation: as part of API simplification, the parser no longer has to be preallocated-it will adjust automatically when it encounters larger files.
  • Runtime selection API: We've exposed simdjson's runtime CPU detection and implementation selection as an API, so you can tell what implementation we detected and test with other implementations.
  • Error handling your way: Whether you use exceptions or check error codes, simdjson lets you handle errors in your style. APIs that can fail return simdjson_result, letting you check the error code before using the result. But if you are more comfortable with exceptions, skip the error code and cast straight to T, and exceptions will be thrown automatically if an error happens. Use the same API either way!
  • Error chaining: We also worked to keep non-exception error-handling short and sweet. Instead of having to check the error code after every single operation, now you can chain JSON navigation calls like looking up an object field or array element, or casting to a string, so that you only have to check the error code once at the very end.

Reject strings 4GB or longer before calling simdjson

PHP 7 allows strings longer than 4GB. simdjson doesn't.

src/simdjson.h
309:constexpr size_t SIMDJSON_MAXSIZE_BYTES = 0xFFFFFFFF;

Alternately, upgrade to a newer simdjson release that includes that check on inputs.

really_inline error_code set_capacity(internal::dom_parser_implementation &parser, size_t capacity) {
  size_t max_structures = ROUNDUP_N(capacity, 64) + 2 + 7;
  parser.structural_indexes.reset( new (std::nothrow) uint32_t[max_structures] );
  if (!parser.structural_indexes) { return MEMALLOC; }
  parser.structural_indexes[0] = 0;
  parser.n_structural_indexes = 0;
  return SUCCESS;
}

Expected: { CAPACITY, "This parser can't support a document that big" },
Observed: { MEMALLOC, "Error allocating memory, we're most likely out of memory" },

$data = str_repeat(' ', 5_000_000_000);
$data .= '0';
echo simdjson_decode($data);

Probably related to set_capacity's implementation.

Low priority since it's properly handled.

Set up test scripts to compare simdjson_decode and json_decode handling of json edge cases

See https://github.com/TkTech/pysimdjson/tree/master/jsonexamples/test_parsing - y_* files are valid json, n_* files are invalid json, and i_*.json are implementation defined (I think)

Ideally, if simdjson_decode returns a valid result instead of throwing, then json_decode should return the exact same result (e.g. for parsing floats)

  • simdjson_decode will throw in some cases where json_decode won't. See the README.

Make error handling stricter

  • Throw RuntimeException in simdjson_key_exists/simdjson_key_count for invalid json
  • Throw RuntimeException SimdJsonValueError instead of returning null and emitting a notice for invalid int $depth values

Mark this as a major release for anything relying on the old behavior

simdjson_is_valid fails silently when json has depth exceeding 1024

With display_errors=stderr and error_reporting=E_ALL, there is still no reported error.
simdjson should probably check the error result for errors specifically caused by recursion depth and emit a notice

php > var_export(simdjson_is_valid(str_repeat('[', 1000) . str_repeat(']', 1000)));
true
php > var_export(simdjson_is_valid(str_repeat('[', 1050) . str_repeat(']', 1050)));
false
  1. This should emit a warning or an error instead?
  2. The depth limit should be configurable
  3. Note that struct open_container in the simdjson implementation means that 8+1 bytes will be used for a given depth limit, so the alternative of setting depth to json_length / 2 may (1) overflow size_t on 32-bit implementation, and (2) run out of memory

Investigation notes for flag analogous to JSON_BIGINT_AS_STRING

This may or may not be difficult with the dom parser.

Future simdjson C releases may make this easier but this functionality might not be planned for arbitrary-precision integers (only floats) - https://github.com/simdjson/simdjson/pull/1886 is in review pending manual tests and performance testing.

  • See my PR comment for a note on how BIGINT might be possible in followup

https://github.com/simdjson/simdjson/issues/167 suggests it's possible with the ondemand parser, but that may be slightly slower than the dom parser?


https://github.com/simdjson/simdjson/issues/425#issuecomment-883605600 mentions this is supported in the ondemand api

We may want to take the tiny performance hit just to imitate php's behavior for numbers in json_decode()

Note that this is effectively supported now with the On Demand API and documented as such:

simdjson::ondemand::parser parser;
simdjson::padded_string docdata =  R"({"value":12321323213213213213213213213211223})"_padded;
simdjson::ondemand::document doc = parser.iterate(docdata);
simdjson::ondemand::object obj = doc.get_object();
std::string_view token = obj["value"].raw_json_token();

Consider updating to simdjson 0.2.0

The library simdjson has a new major release (0.2.0). Major changes:

  • Support for 64-bit ARM processors, can run under iOS (iPhone).
  • Runtime dispatching on x64 processors (switches to SSE on older x64 processors, uses AVX2 when available). Supports processors as far back as the Intel Westmere.
  • More accurate number parsing.
  • Fixes most warnings under Visual Studio.
  • Several small bugs have been fixed.
  • Better performance in some cases.
  • Introduces a JSON Pointer interface https://tools.ietf.org/html/rfc6901
  • Better and more specific error messages (with optional textual descriptions).
  • valgrind clean.
  • Unified code style (LLVM).

`parser.allocate` will reallocate buffers - call allocate only to change depth

https://github.com/simdjson/simdjson/blob/master/doc/dom.md#reusing-the-parser-for-maximum-efficiency

If you're using simdjson to parse multiple documents, or in a loop, you should make a parser once and reuse it. The simdjson library will allocate and retain internal buffers between parses, keeping buffers hot in cache and keeping memory allocation and initialization to a minimum. In this manner, you can parse terabytes of JSON data without doing any new allocation.

class simdjson::dom::parser only provides set_max_depth(), allocate(), but not set_capacity(). So to set just the max depth, only call allocate() if the depth actually changed, which should be infrequent

  • parser::parse_into_document calls ensure_capacity already, and ensure_capacity calls allocate if needed

Related to #73

Note that simdjson will not need capacities beyond the range of a uint32, and will reject requests for larger capacities

/** The maximum document size supported by simdjson. */
constexpr size_t SIMDJSON_MAXSIZE_BYTES = 0xFFFFFFFF;
simdjson_warn_unused simdjson_inline error_code parser::allocate(size_t new_capacity, size_t new_max_depth) noexcept {
  if (new_capacity > max_capacity()) { return CAPACITY; }
  if (string_buf && new_capacity == capacity() && new_max_depth == max_depth()) { return SUCCESS; }

  // string_capacity copied from document::allocate
  _capacity = 0;
  size_t string_capacity = SIMDJSON_ROUNDUP_N(5 * new_capacity / 3 + SIMDJSON_PADDING, 64);
  string_buf.reset(new (std::nothrow) uint8_t[string_capacity]);
#if SIMDJSON_DEVELOPMENT_CHECKS
  start_positions.reset(new (std::nothrow) token_position[new_max_depth]);
#endif
  if (implementation) {
    SIMDJSON_TRY( implementation->set_capacity(new_capacity) );
    SIMDJSON_TRY( implementation->set_max_depth(new_max_depth) );
  } else {
    SIMDJSON_TRY( simdjson::get_active_implementation()->create_dom_parser_implementation(new_capacity, new_max_depth, implementation) );
  }
  _capacity = new_capacity;
  _max_depth = new_max_depth;
  return SUCCESS;
}

Prepare for setting this up as a PECL, check in package.xml?

I see build-packagexml.php but no package.xml.

  • Make reporting bugs easier by making the release easier to identify
  • Track the changelog in package.xml
  • Indicate min/max supported php versions
  • One of the comments on #8 was that it couldn't be tried out without a PECL

Throw SimdJsonException extends RuntimeException

Similar to how json_decode throws https://www.php.net/jsonexception with JSON_THROW_ON_ERROR

JsonException can't be used because the error type is different

code contains the error type, for possible values see json_last_error().

In addition to that, the json module is optional before php 8.0

Set codes and expose them as constants (and translate values if the simdjson implementation later changes)


This would help in writing simpler error handling in user try-catch blocks that call simdjson_decode then process the data

Continue to throw a subclass of RuntimeException for compatibility with older releases, and because this happens at runtime based on the value

Add composer library /example snippets to make it easier for projects to automatically choose simdjson if supported

// namespace goes here

/**
 * JSON decoder
 */
final class Decoder
{
    private static $simdjsonEnabled = false;

    public static function setAllowSimdjson(bool $allow): void
    {
        self::$simdjsonEnabled = $allow && self::hasWorkingSimdjson();
    }

    public static function hasWorkingSimdjson(): bool
    {
        return version_compare(phpversion('simdjson') ?: '0', '2.1.0', '>=') &&
             class_exists('SimdJsonException');
    }

    /**
     * JSON decodes the given json string $msg
     *
     * @throws JsonException
     */
    public static function jsonDecode(string $msg, bool $associative = false)
    {
        if (self::$simdjsonEnabled) {
            try {
                return simdjson_decode($msg, $associative);
            } catch (\Exception $e) {
                // Assume errors are rare and fall through, to return the exact same Error messages
            }
        }
        $decoded = \json_decode($msg, $associative);
        if (\json_last_error() !== \JSON_ERROR_NONE) {
            throw new \JsonException(\json_last_error_msg(), \json_last_error());  // requires symfony/polyfill-php73 before php 7.3
        }
        return $decoded;
    }
}

Decoder::setAllowSimdjson(true);
use ...\Decoder;
var_export(Decoder::jsonDecode('{"a":"b"}', true));

https://github.com/crazyxman/simdjson_php/tree/master/benchmark - for applications that work with long json strings, the extra method call would be worth it

[Suggest] Rename

Rename to simdjson and add topics
Change description
php bindings for the simdjson project.

Patch C simdjson library to use emalloc_safe/efree for depth/byte buffers

  • This would allow performance monitoring tools to properly indicate how much memory simdjson uses during a request
  • This would respect memory_limit settings in a conventional way and properly cause a php fatal error for invalid requests, e.g. an application receiving too much json

https://github.com/simdjson/simdjson/issues/1017 does not allow customizing the allocator yet

Idea: Throw SimdJsonException for a far lower user-provided max depth, in a subsequent major release

C programs usually have a small stack size by default, and https://www.php.net/fiber is also low for the few use cases of fibers

Applications that override $depth in simdjson_decode to a much larger value may have a stack overflow and crash when they actually attempt to parse json (e.g. user-provided) of that depth by calling simdjson_decode.

  • (PHP's json_decode currently uses a parser based on bison, so the emitted code to convert json to php values doesn't actually use the C stack recursively, and doesn't have this problem)

The current default depth of 512 is fine (2097152 bytes of data), but the current max depth is only chosen to avoid running out of C memory when allocating buffers or allocating more than needed. It should be much lower

Zend/zend_fibers.h
28:#define ZEND_FIBER_DEFAULT_C_STACK_SIZE (4096 * (((sizeof(void *)) < 8) ? 256 : 512))

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.