crazyxman / simdjson_php Goto Github PK
View Code? Open in Web Editor NEWsimdjson_php bindings for the simdjson project. https://github.com/lemire/simdjson
License: Apache License 2.0
simdjson_php bindings for the simdjson project. https://github.com/lemire/simdjson
License: Apache License 2.0
We can set the minimum PHP version to 7.3 for this extension. See PHP Supported Versions
If you're using simdjson to parse multiple documents, or in a loop, you should make a parser once and reuse it. The simdjson library will allocate and retain internal buffers between parses, keeping buffers hot in cache and keeping memory allocation and initialization to a minimum. In this manner, you can parse terabytes of JSON data without doing any new allocation.
class simdjson::dom::parser
only provides set_max_depth(), allocate(), but not set_capacity(). So to set just the max depth, only call allocate() if the depth actually changed, which should be infrequent
parser::parse_into_document
calls ensure_capacity already, and ensure_capacity calls allocate if neededRelated to #73
Note that simdjson will not need capacities beyond the range of a uint32, and will reject requests for larger capacities
/** The maximum document size supported by simdjson. */
constexpr size_t SIMDJSON_MAXSIZE_BYTES = 0xFFFFFFFF;
simdjson_warn_unused simdjson_inline error_code parser::allocate(size_t new_capacity, size_t new_max_depth) noexcept {
if (new_capacity > max_capacity()) { return CAPACITY; }
if (string_buf && new_capacity == capacity() && new_max_depth == max_depth()) { return SUCCESS; }
// string_capacity copied from document::allocate
_capacity = 0;
size_t string_capacity = SIMDJSON_ROUNDUP_N(5 * new_capacity / 3 + SIMDJSON_PADDING, 64);
string_buf.reset(new (std::nothrow) uint8_t[string_capacity]);
#if SIMDJSON_DEVELOPMENT_CHECKS
start_positions.reset(new (std::nothrow) token_position[new_max_depth]);
#endif
if (implementation) {
SIMDJSON_TRY( implementation->set_capacity(new_capacity) );
SIMDJSON_TRY( implementation->set_max_depth(new_max_depth) );
} else {
SIMDJSON_TRY( simdjson::get_active_implementation()->create_dom_parser_implementation(new_capacity, new_max_depth, implementation) );
}
_capacity = new_capacity;
_max_depth = new_max_depth;
return SUCCESS;
}
C programs usually have a small stack size by default, and https://www.php.net/fiber is also low for the few use cases of fibers
Applications that override $depth in simdjson_decode to a much larger value may have a stack overflow and crash when they actually attempt to parse json (e.g. user-provided) of that depth by calling simdjson_decode.
The current default depth of 512 is fine (2097152 bytes of data), but the current max depth is only chosen to avoid running out of C memory when allocating buffers or allocating more than needed. It should be much lower
Zend/zend_fibers.h
28:#define ZEND_FIBER_DEFAULT_C_STACK_SIZE (4096 * (((sizeof(void *)) < 8) ? 256 : 512))
This may be useful if applications parse megabytes of json on startup, but don't use simdjson for large requests later on.
This would free the buffers used by the underlying C simdjson library
Rename to simdjson and add topics
Change description
php bindings for the simdjson project.
simdjson_key_exists
/simdjson_key_count
for invalid jsonint $depth
valuesMark this as a major release for anything relying on the old behavior
PHP 7 allows strings longer than 4GB. simdjson doesn't.
src/simdjson.h
309:constexpr size_t SIMDJSON_MAXSIZE_BYTES = 0xFFFFFFFF;
Alternately, upgrade to a newer simdjson release that includes that check on inputs.
really_inline error_code set_capacity(internal::dom_parser_implementation &parser, size_t capacity) {
size_t max_structures = ROUNDUP_N(capacity, 64) + 2 + 7;
parser.structural_indexes.reset( new (std::nothrow) uint32_t[max_structures] );
if (!parser.structural_indexes) { return MEMALLOC; }
parser.structural_indexes[0] = 0;
parser.n_structural_indexes = 0;
return SUCCESS;
}
Expected: { CAPACITY, "This parser can't support a document that big" },
Observed: { MEMALLOC, "Error allocating memory, we're most likely out of memory" },
$data = str_repeat(' ', 5_000_000_000);
$data .= '0';
echo simdjson_decode($data);
Probably related to set_capacity
's implementation.
Low priority since it's properly handled.
The library simdjson has a new major release (0.2.0). Major changes:
// src/bindings.cpp
// see https://github.com/simdjson/simdjson/blob/master/doc/performance.md#reusing-the-parser-for-maximum-efficiency
simdjson::dom::parser parser;
Look at php-src/ext/session/session.c and php_session.h for an example of this in ps_globals, PHP_GINIT_FUNCTION, PS() macro, etc.
GSHUTDOWN may be needed for the destructor in ZTS builds.
Avoid multiple threads using the same parser instance at the same time.
Consider making this request-local in RINIT instead of GINIT to avoid holding on to memory after it spikes for a single request (e.g. if parsing a really long json document once in httpd)
.appveyor.yml
is checked in but not enabled. This is building successfully and enabled on my fork - https://ci.appveyor.com/project/TysonAndre/simdjson-php/history
This would detect compilation errors on Windows
EDIT: It builds as https://ci.appveyor.com/project/TysonAndre/crazyxman-simdjson-php for the crazyxman/simdjson_php repo, but because I'm a collaborator, it won't create web hooks that trigger when PRs are updated, and I don't have access to project settings (through oauth or that)? https://help.appveyor.com/discussions/problems/5554-if-a-pull-request-is-open-before-adding-project-to-appveyor-even-new-commits-dont-trigger-builds
If the value of size() is 0xFFFFFF then iterate over the object/array instead to count the keys
Add test cases.
// class simdjson::dom::array
/**
* Get the size of the array (number of immediate children).
* It is a saturated value with a maximum of 0xFFFFFF: if the value
* is 0xFFFFFF then the size is 0xFFFFFF or greater.
*/
inline size_t size() const noexcept;
// class simdjson::dom::object
/**
* Get the size of the object (number of keys).
* It is a saturated value with a maximum of 0xFFFFFF: if the value
* is 0xFFFFFF then the size is 0xFFFFFF or greater.
*/
inline size_t size() const noexcept;
This added line breaks the complication from source (Ubuntu 22.04.5 / GCC 11.2.0):
php_simdjson.cpp:250:61: error: ‘active_implementation’ is not a member of ‘simdjson’; did you mean ‘get_active_implementation’?
250 | php_info_print_table_row(2, "Implementation", simdjson::active_implementation->description().c_str());
Originally posted by @vassil-velichkov in #44 (comment)
See 08ac373 for more details.
@remicollet Any idea how to fix it?
This may or may not be difficult with the dom parser.
Future simdjson C releases may make this easier but this functionality might not be planned for arbitrary-precision integers (only floats) - https://github.com/simdjson/simdjson/pull/1886
is in review pending manual tests and performance testing.
https://github.com/simdjson/simdjson/issues/167
suggests it's possible with the ondemand parser, but that may be slightly slower than the dom parser?
https://github.com/simdjson/simdjson/issues/425#issuecomment-883605600
mentions this is supported in the ondemand api
We may want to take the tiny performance hit just to imitate php's behavior for numbers in json_decode()
Note that this is effectively supported now with the On Demand API and documented as such:
simdjson::ondemand::parser parser;
simdjson::padded_string docdata = R"({"value":12321323213213213213213213213211223})"_padded;
simdjson::ondemand::document doc = parser.iterate(docdata);
simdjson::ondemand::object obj = doc.get_object();
std::string_view token = obj["value"].raw_json_token();
// src/simdjson.h
// Write unsigned if it doesn't fit in a signed integer.
if (i > uint64_t(INT64_MAX)) {
WRITE_UNSIGNED(i, src, writer);
} else {
WRITE_INTEGER(negative ? (~i+1) : i, src, writer);
}
php > var_export(simdjson_decode('9223372036854775808'));
-9223372036854775807-1
php > var_export(simdjson_decode('9223372036854775807'));
9223372036854775807
Related to #65
simdjson_is_valid(): bool
and other PHP bindings would unpredictably be false if the OS was out of memory
E.g. use zend_error_noreturn to terminate the process
https://pecl.php.net/package/simdjson
pecl install simdjson
Note that the Windows PHP PECL build host on windows.php.net was down the last time I checked, so there are no dlls
E.g. if there's a long-running CLI application that calls simdjson_decode()
once on startup, for a 100 megabyte long string, then avoid keeping around the 100 megabyte buffer that was allocated for the C simdjson parser after it is finished being used.
This would make it easier to reason about the worst-case impact of simdjson on PHP's memory usage.
If a buffer would be longer than a certain threshold (e.g. 100KB), then use a different short-lived instance (so that the same parser/buffer can be reused when calling simdjson again on small json blobs)
Update PHP_SIMDJSON_VERSION
Related to #26
See https://github.com/TkTech/pysimdjson/tree/master/jsonexamples/test_parsing - y_*
files are valid json, n_*
files are invalid json, and i_*.json
are implementation defined (I think)
Ideally, if simdjson_decode returns a valid result instead of throwing, then json_decode should return the exact same result (e.g. for parsing floats)
I see build-packagexml.php but no package.xml.
We use the version 0.4.7, so maybe more adjustments are needed. Version 0.5.0 and 0.6.0 brings new features. See release page for more details.
E.g. return an array with fields type and count or null if the key does not exist
Similar to how json_decode throws https://www.php.net/jsonexception with JSON_THROW_ON_ERROR
JsonException can't be used because the error type is different
code contains the error type, for possible values see json_last_error().
In addition to that, the json module is optional before php 8.0
Set codes and expose them as constants (and translate values if the simdjson implementation later changes)
This would help in writing simpler error handling in user try-catch blocks that call simdjson_decode then process the data
Continue to throw a subclass of RuntimeException for compatibility with older releases, and because this happens at runtime based on the value
It should be possible to use circleci or drone to setup tests. At a minimum, one should check that the install process works.
With display_errors=stderr and error_reporting=E_ALL, there is still no reported error.
simdjson should probably check the error
result for errors specifically caused by recursion depth and emit a notice
php > var_export(simdjson_is_valid(str_repeat('[', 1000) . str_repeat(']', 1000)));
true
php > var_export(simdjson_is_valid(str_repeat('[', 1050) . str_repeat(']', 1050)));
false
struct open_container
in the simdjson implementation means that 8+1 bytes will be used for a given depth limit, so the alternative of setting depth to json_length / 2
may (1) overflow size_t
on 32-bit implementation, and (2) run out of memoryIt fixes a few potential buffer overruns for adversarial inputs. It should be safer.
(instead of throwing SimdJsonException for the capacity error)
simdjson_decode is meant to have indistinguishable behavior from json_decode apart from error codes/classes/messages, so this should be safe
SIMDJSON_ERR*
error codeLibraries using simdjson_decode may be forced to add extra error handling if they expect to be used for JSON > 4 GiB by any application using them. It'd be more convenient to avoid that.
Low priority compared to other tasks due to most php installations having max upload sizes
When the depth is both larger than the string length (guaranteeing depth limit won't be hit) and larger than a cutoff such as 100000, ignore it and use a smaller byte length value such as
>=
the string lengthThe fact that memory is allocated for depth is different from json_decode, so users might not expect the high memory usage when attempting to avoid depth errors
memory_limit
settings in a conventional way and properly cause a php fatal error for invalid requests, e.g. an application receiving too much jsonhttps://github.com/simdjson/simdjson/issues/1017
does not allow customizing the allocator yet
It looks like the depth argument is missing in the ZEND_BEGIN_ARG_INFO for the funtions simdjson_key_value
, simdjson_key_exists
and simdjson_key_count
.
Should this be added or why is it not defined like for simdjson_decode_arginfo
?
/** ...
* @param realloc_if_needed Whether to reallocate and enlarge the JSON buffer to add padding.
* @return An element pointing at the root of the document, or an error:
* - MEMALLOC if realloc_if_needed is true or the parser does not have enough capacity,
* and memory allocation fails.
* - CAPACITY if the parser does not have enough capacity and len > max_capacity.
* - other json errors if parsing fails. You should not rely on these errors to always the same for the
* same document: they may vary under runtime dispatch (so they may vary depending on your system and hardware).
*/
inline simdjson_result<element> parse(const uint8_t *buf, size_t len, bool realloc_if_needed = true) & noexcept;
If the capacity is at least 64 bytes larger than the string size, it's safe to avoid the extra copy and set realloc_if_needed=false, as long as the 64 bytes are initialized memory (any value) (not sure which architectures this matters for)
类似json_decode,既可以输出object,也可以输出array
Highlights
New features:
Performance:
System support:
CircleCI was enabled this year, but fails because there's no configuration file
Can you adjust the decoder to read faster, probably 4bit is a lot faster then the default 64bit, so we can choose the datatype. For computation of embeddings for example, also it compress better, keep the filesize smaller.
from #5 (comment)
I cannot build the latest commit under Alpine Linux 3.10, getting these errors:
$ make
/bin/sh /root/simdjson_php/libtool --mode=compile cc -I. -I/root/simdjson_php -DPHP_ATOM_INC -I/root/simdjson_php/include -I/root/simdjson_php/main -I/root/simdjson_php -I/usr/include/php7 -I/usr/include/php7/main -I/usr/include/php7/TSRM -I/usr/include/php7/Zend -I/usr/include/php7/ext -I/usr/include/php7/ext/date/lib -DHAVE_CONFIG_H -g -O2 -c /root/simdjson_php/simdjson.c -o simdjson.lo
mkdir .libs
cc -I. -I/root/simdjson_php -DPHP_ATOM_INC -I/root/simdjson_php/include -I/root/simdjson_php/main -I/root/simdjson_php -I/usr/include/php7 -I/usr/include/php7/main -I/usr/include/php7/TSRM -I/usr/include/php7/Zend -I/usr/include/php7/ext -I/usr/include/php7/ext/date/lib -DHAVE_CONFIG_H -g -O2 -c /root/simdjson_php/simdjson.c -fPIC -DPIC -o .libs/simdjson.o
In file included from /root/simdjson_php/simdjson.c:23:
/root/simdjson_php/src/bindings.h:25:1: error: unknown type name 'namespace'; did you mean 'isspace'?
namespace simdjsonphp {
^~~~~~~~~
isspace
/root/simdjson_php/src/bindings.h:25:23: error: expected '=', ',', ';', 'asm' or '__attribute__' before '{' token
namespace simdjsonphp {
^
/root/simdjson_php/src/bindings.h:41:1: error: unknown type name 'bool'; did you mean '_Bool'?
bool cplus_simdjson_isvalid(const char *json);
^~~~
_Bool
/root/simdjson_php/simdjson.c:56:22: error: conflicting types for 'cplus_simdjson_isvalid'
extern unsigned char cplus_simdjson_isvalid(const char *json);
^~~~~~~~~~~~~~~~~~~~~~
In file included from /root/simdjson_php/simdjson.c:23:
/root/simdjson_php/src/bindings.h:41:6: note: previous declaration of 'cplus_simdjson_isvalid' was here
bool cplus_simdjson_isvalid(const char *json);
^~~~~~~~~~~~~~~~~~~~~~
make: *** [Makefile:193: simdjson.lo] Error 1
Commit a0e942d
from May 29th builds fine.
If you want to try to reproduce, I started from a clean alpine:3.10
Docker container and only needed to install these packages: apk add git build-base php7-dev
Could you create RFC to maybe implement this into PHP? The speed imporovement is impressive and would improve PHP itself.
// namespace goes here
/**
* JSON decoder
*/
final class Decoder
{
private static $simdjsonEnabled = false;
public static function setAllowSimdjson(bool $allow): void
{
self::$simdjsonEnabled = $allow && self::hasWorkingSimdjson();
}
public static function hasWorkingSimdjson(): bool
{
return version_compare(phpversion('simdjson') ?: '0', '2.1.0', '>=') &&
class_exists('SimdJsonException');
}
/**
* JSON decodes the given json string $msg
*
* @throws JsonException
*/
public static function jsonDecode(string $msg, bool $associative = false)
{
if (self::$simdjsonEnabled) {
try {
return simdjson_decode($msg, $associative);
} catch (\Exception $e) {
// Assume errors are rare and fall through, to return the exact same Error messages
}
}
$decoded = \json_decode($msg, $associative);
if (\json_last_error() !== \JSON_ERROR_NONE) {
throw new \JsonException(\json_last_error_msg(), \json_last_error()); // requires symfony/polyfill-php73 before php 7.3
}
return $decoded;
}
}
Decoder::setAllowSimdjson(true);
use ...\Decoder;
var_export(Decoder::jsonDecode('{"a":"b"}', true));
https://github.com/crazyxman/simdjson_php/tree/master/benchmark - for applications that work with long json strings, the extra method call would be worth it
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.