Coder Social home page Coder Social logo

halaxa / json-machine Goto Github PK

View Code? Open in Web Editor NEW
1.0K 17.0 62.0 1.49 MB

Efficient, easy-to-use, and fast PHP JSON stream parser

License: Apache License 2.0

PHP 94.76% Shell 1.97% Makefile 3.27%
php stream-processing json-parser json-iterator json-stream parsing

json-machine's Introduction

JSON Machine

Very easy to use and memory efficient drop-in replacement for inefficient iteration of big JSON files or streams for PHP >=7.0. See TL;DR. No dependencies in production except optional ext-json. README in sync with the code

Build Status codecov Latest Stable Version Monthly Downloads



TL;DR

<?php

use \JsonMachine\Items;

// this often causes Allowed Memory Size Exhausted
- $users = json_decode(file_get_contents('500MB-users.json'));

// this usually takes few kB of memory no matter the file size
+ $users = Items::fromFile('500MB-users.json');

foreach ($users as $id => $user) {
    // just process $user as usual
    var_dump($user->name);
}

Random access like $users[42] is not yet possible. Use above-mentioned foreach and find the item or use JSON Pointer.

Count the items via iterator_count($users). Remember it will still have to internally iterate the whole thing to get the count and thus will take about the same time.

Requires ext-json if used out of the box. See Decoders.

Follow CHANGELOG.

Introduction

JSON Machine is an efficient, easy-to-use and fast JSON stream/pull/incremental/lazy (whatever you name it) parser based on generators developed for unpredictably long JSON streams or documents. Main features are:

  • Constant memory footprint for unpredictably large JSON documents.
  • Ease of use. Just iterate JSON of any size with foreach. No events and callbacks.
  • Efficient iteration on any subtree of the document, specified by JSON Pointer
  • Speed. Performance critical code contains no unnecessary function calls, no regular expressions and uses native json_decode to decode JSON document items by default. See Decoders.
  • Parses not only streams but any iterable that produces JSON chunks.
  • Thoroughly tested. More than 200 tests and 1000 assertions.

Parsing JSON documents

Parsing a document

Let's say that fruits.json contains this huge JSON document:

// fruits.json
{
    "apple": {
        "color": "red"
    },
    "pear": {
        "color": "yellow"
    }
}

It can be parsed this way:

<?php

use \JsonMachine\Items;

$fruits = Items::fromFile('fruits.json');

foreach ($fruits as $name => $data) {
    // 1st iteration: $name === "apple" and $data->color === "red"
    // 2nd iteration: $name === "pear" and $data->color === "yellow"
}

Parsing a json array instead of a json object follows the same logic. The key in a foreach will be a numeric index of an item.

If you prefer JSON Machine to return arrays instead of objects, use new ExtJsonDecoder(true) as a decoder.

<?php

use JsonMachine\JsonDecoder\ExtJsonDecoder;
use JsonMachine\Items;

$objects = Items::fromFile('path/to.json', ['decoder' => new ExtJsonDecoder(true)]);

Parsing a subtree

If you want to iterate only results subtree in this fruits.json:

// fruits.json
{
    "results": {
        "apple": {
            "color": "red"
        },
        "pear": {
            "color": "yellow"
        }
    }
}

use JSON Pointer /results as pointer option:

<?php

use \JsonMachine\Items;

$fruits = Items::fromFile('fruits.json', ['pointer' => '/results']);
foreach ($fruits as $name => $data) {
    // The same as above, which means:
    // 1st iteration: $name === "apple" and $data->color === "red"
    // 2nd iteration: $name === "pear" and $data->color === "yellow"
}

Note:

Value of results is not loaded into memory at once, but only one item in results at a time. It is always one item in memory at a time at the level/subtree you are currently iterating. Thus, the memory consumption is constant.

Parsing nested values in arrays

The JSON Pointer spec also allows to use a hyphen (-) instead of a specific array index. JSON Machine interprets it as a wildcard which matches any array index (not any object key). This enables you to iterate nested values in arrays without loading the whole item.

Example:

// fruitsArray.json
{
    "results": [
        {
            "name": "apple",
            "color": "red"
        },
        {
            "name": "pear",
            "color": "yellow"
        }
    ]
}

To iterate over all colors of the fruits, use the JSON Pointer "/results/-/color".

<?php

use \JsonMachine\Items;

$fruits = Items::fromFile('fruitsArray.json', ['pointer' => '/results/-/color']);

foreach ($fruits as $key => $value) {
    // 1st iteration:
    $key == 'color';
    $value == 'red';
    $fruits->getMatchedJsonPointer() == '/results/-/color';
    $fruits->getCurrentJsonPointer() == '/results/0/color';

    // 2nd iteration:
    $key == 'color';
    $value == 'yellow';
    $fruits->getMatchedJsonPointer() == '/results/-/color';
    $fruits->getCurrentJsonPointer() == '/results/1/color';
}

Parsing a single scalar value

You can parse a single scalar value anywhere in the document the same way as a collection. Consider this example:

// fruits.json
{
    "lastModified": "2012-12-12",
    "apple": {
        "color": "red"
    },
    "pear": {
        "color": "yellow"
    },
    // ... gigabytes follow ...
}

Get the scalar value of lastModified key like this:

<?php

use \JsonMachine\Items;

$fruits = Items::fromFile('fruits.json', ['pointer' => '/lastModified']);
foreach ($fruits as $key => $value) {
    // 1st and final iteration:
    // $key === 'lastModified'
    // $value === '2012-12-12'
}

When parser finds the value and yields it to you, it stops parsing. So when a single scalar value is in the beginning of a gigabytes-sized file or stream, it just gets the value from the beginning in no time and with almost no memory consumed.

The obvious shortcut is:

<?php

use \JsonMachine\Items;

$fruits = Items::fromFile('fruits.json', ['pointer' => '/lastModified']);
$lastModified = iterator_to_array($fruits)['lastModified'];

Single scalar value access supports array indices in JSON Pointer as well.

Parsing multiple subtrees

It is also possible to parse multiple subtrees using multiple JSON Pointers. Consider this example:

// fruits.json
{
    "lastModified": "2012-12-12",
    "berries": [
        {
          "name": "strawberry", // not a berry, but whatever ...
          "color": "red"
        },
        {
          "name": "raspberry", // the same ...
          "color": "red"
        }
    ],
    "citruses": [
      {
          "name": "orange",
          "color": "orange"
      },
      {
          "name": "lime",
          "color": "green"
      }
    ]
}

To iterate over all berries and citrus fruits, use the JSON pointers ["/berries", "/citrus"]. The order of pointers does not matter. The items will be iterated in the order of appearance in the document.

<?php

use \JsonMachine\Items;

$fruits = Items::fromFile('fruits.json', [
    'pointer' => ['/berries', '/citruses']
]);

foreach ($fruits as $key => $value) {
    // 1st iteration:
    $value == ["name" => "strawberry", "color" => "red"];
    $fruits->getCurrentJsonPointer() == '/berries';

    // 2nd iteration:
    $value == ["name" => "raspberry", "color" => "red"];
    $fruits->getCurrentJsonPointer() == '/berries';

    // 3rd iteration:
    $value == ["name" => "orange", "color" => "orange"];
    $fruits->getCurrentJsonPointer() == '/citruses';

    // 4th iteration:
    $value == ["name" => "lime", "color" => "green"];
    $fruits->getCurrentJsonPointer() == '/citruses';
}

What is JSON Pointer anyway?

It's a way of addressing one item in JSON document. See the JSON Pointer RFC 6901. It's very handy, because sometimes the JSON structure goes deeper, and you want to iterate a subtree, not the main level. So you just specify the pointer to the JSON array or object (or even to a scalar value) you want to iterate and off you go. When the parser hits the collection you specified, iteration begins. You can pass it as pointer option in all Items::from* functions. If you specify a pointer to a non-existent position in the document, an exception is thrown. It can be used to access scalar values as well. JSON Pointer itself must be a valid JSON string. Literal comparison of reference tokens (the parts between slashes) is performed against the JSON document keys/member names.

Some examples:

JSON Pointer value Will iterate through
(empty string - default) ["this", "array"] or {"a": "this", "b": "object"} will be iterated (main level)
/result/items {"result": {"items": ["this", "array", "will", "be", "iterated"]}}
/0/items [{"items": ["this", "array", "will", "be", "iterated"]}] (supports array indices)
/results/-/status {"results": [{"status": "iterated"}, {"status": "also iterated"}]} (a hyphen as an array index wildcard)
/ (gotcha! - a slash followed by an empty string, see the spec) {"":["this","array","will","be","iterated"]}
/quotes\" {"quotes\"": ["this", "array", "will", "be", "iterated"]}

Options

Options may change how a JSON is parsed. Array of options is the second parameter of all Items::from* functions. Available options are:

  • pointer - A JSON Pointer string that tells which part of the document you want to iterate.
  • decoder - An instance of ItemDecoder interface.
  • debug - true or false to enable or disable the debug mode. When the debug mode is enabled, data such as line, column and position in the document are available during parsing or in exceptions. Keeping debug disabled adds slight performance advantage.

Parsing streaming responses from a JSON API

A stream API response or any other JSON stream is parsed exactly the same way as file is. The only difference is, you use Items::fromStream($streamResource) for it, where $streamResource is the stream resource with the JSON document. The rest is the same as with parsing files. Here are some examples of popular http clients which support streaming responses:

GuzzleHttp

Guzzle uses its own streams, but they can be converted back to PHP streams by calling \GuzzleHttp\Psr7\StreamWrapper::getResource(). Pass the result of this function to Items::fromStream function, and you're set up. See working GuzzleHttp example.

Symfony HttpClient

A stream response of Symfony HttpClient works as iterator. And because JSON Machine is based on iterators, the integration with Symfony HttpClient is very simple. See HttpClient example.

Tracking the progress (with debug enabled)

Big documents may take a while to parse. Call Items::getPosition() in your foreach to get current count of the processed bytes from the beginning. Percentage is then easy to calculate as position / total * 100. To find out the total size of your document in bytes you may want to check:

  • strlen($document) if you parse a string
  • filesize($file) if you parse a file
  • Content-Length http header if you parse a http stream response
  • ... you get the point

If debug is disabled, getPosition() always returns 0.

<?php

use JsonMachine\Items;

$fileSize = filesize('fruits.json');
$fruits = Items::fromFile('fruits.json', ['debug' => true]);
foreach ($fruits as $name => $data) {
    echo 'Progress: ' . intval($fruits->getPosition() / $fileSize * 100) . ' %'; 
}

Decoders

Items::from* functions also accept decoder option. It must be an instance of JsonMachine\JsonDecoder\ItemDecoder. If none is specified, ExtJsonDecoder is used by default. It requires ext-json PHP extension to be present, because it uses json_decode. When json_decode doesn't do what you want, implement JsonMachine\JsonDecoder\ItemDecoder and make your own.

Available decoders

  • ExtJsonDecoder - Default. Uses json_decode to decode keys and values. Constructor has the same parameters as json_decode.

  • PassThruDecoder - Does no decoding. Both keys and values are produced as pure JSON strings. Useful when you want to parse a JSON item with something else directly in the foreach and don't want to implement JsonMachine\JsonDecoder\ItemDecoder. Since 1.0.0 does not use json_decode.

Example:

<?php

use JsonMachine\JsonDecoder\PassThruDecoder;
use JsonMachine\Items;

$items = Items::fromFile('path/to.json', ['decoder' => new PassThruDecoder]);
  • ErrorWrappingDecoder - A decorator which wraps decoding errors inside DecodingError object thus enabling you to skip malformed items instead of dying on SyntaxError exception. Example:
<?php

use JsonMachine\Items;
use JsonMachine\JsonDecoder\DecodingError;
use JsonMachine\JsonDecoder\ErrorWrappingDecoder;
use JsonMachine\JsonDecoder\ExtJsonDecoder;

$items = Items::fromFile('path/to.json', ['decoder' => new ErrorWrappingDecoder(new ExtJsonDecoder())]);
foreach ($items as $key => $item) {
    if ($key instanceof DecodingError || $item instanceof DecodingError) {
        // handle error of this malformed json item
        continue;
    }
    var_dump($key, $item);
}

Error handling

Since 0.4.0 every exception extends JsonMachineException, so you can catch that to filter any error from JSON Machine library.

Skipping malformed items

If there's an error anywhere in a json stream, SyntaxError exception is thrown. That's very inconvenient, because if there is an error inside one json item you are unable to parse the rest of the document because of one malformed item. ErrorWrappingDecoder is a decoder decorator which can help you with that. Wrap a decoder with it, and all malformed items you are iterating will be given to you in the foreach via DecodingError. This way you can skip them and continue further with the document. See example in Available decoders. Syntax errors in the structure of a json stream between the iterated items will still throw SyntaxError exception though.

Parser efficiency

The time complexity is always O(n)

Streams / files

TL;DR: The memory complexity is O(2)

JSON Machine reads a stream (or a file) 1 JSON item at a time and generates corresponding 1 PHP item at a time. This is the most efficient way, because if you had say 10,000 users in JSON file and wanted to parse it using json_decode(file_get_contents('big.json')), you'd have the whole string in memory as well as all the 10,000 PHP structures. Following table shows the difference:

String items in memory at a time Decoded PHP items in memory at a time Total
json_decode() 10000 10000 20000
Items::from*() 1 1 2

This means, that JSON Machine is constantly efficient for any size of processed JSON. 100 GB no problem.

In-memory JSON strings

TL;DR: The memory complexity is O(n+1)

There is also a method Items::fromString(). If you are forced to parse a big string, and the stream is not available, JSON Machine may be better than json_decode. The reason is that unlike json_decode, JSON Machine still traverses the JSON string one item at a time and doesn't load all resulting PHP structures into memory at once.

Let's continue with the example with 10,000 users. This time they are all in string in memory. When decoding that string with json_decode, 10,000 arrays (objects) is created in memory and then the result is returned. JSON Machine on the other hand creates single structure for each found item in the string and yields it back to you. When you process this item and iterate to the next one, another single structure is created. This is the same behaviour as with streams/files. Following table puts the concept into perspective:

String items in memory at a time Decoded PHP items in memory at a time Total
json_decode() 10000 10000 20000
Items::fromString() 10000 1 10001

The reality is even better. Items::fromString consumes about 5x less memory than json_decode. The reason is that a PHP structure takes much more memory than its corresponding JSON representation.

Troubleshooting

"I'm still getting Allowed memory size ... exhausted"

One of the reasons may be that the items you want to iterate over are in some sub-key such as "results" but you forgot to specify a JSON Pointer. See Parsing a subtree.

"That didn't help"

The other reason may be, that one of the items you iterate is itself so huge it cannot be decoded at once. For example, you iterate over users and one of them has thousands of "friend" objects in it. Use PassThruDecoder which does not decode an item, get the json string of the user and parse it iteratively yourself using Items::fromString().

<?php

use JsonMachine\Items;
use JsonMachine\JsonDecoder\PassThruDecoder;

$users = Items::fromFile('users.json', ['decoder' => new PassThruDecoder]);
foreach ($users as $user) {
    foreach (Items::fromString($user, ['pointer' => "/friends"]) as $friend) {
        // process friends one by one
    }
}

"I am still out of luck"

It probably means that the JSON string $user itself or one of the friends are too big and do not fit in memory. However, you can try this approach recursively. Parse "/friends" with PassThruDecoder getting one $friend json string at a time and then parse that using Items::fromString()... If even that does not help, there's probably no solution yet via JSON Machine. A feature is planned which will enable you to iterate any structure fully recursively and strings will be served as streams.

Installation

Using Composer

composer require halaxa/json-machine

Without Composer

Clone or download this repository and add the following to your bootstrap file:

spl_autoload_register(require '/path/to/json-machine/src/autoloader.php');

Development

Clone this repository. This library supports two development approaches:

  1. non containerized (PHP and composer already installed on your machine)
  2. containerized (Docker on your machine)

Non containerized

Run composer run -l in the project dir to see available dev scripts. This way you can run some steps of the build process such as tests.

Containerized

Install Docker and run make in the project dir on your host machine to see available dev tools/commands. You can run all the steps of the build process separately as well as the whole build process at once. Make basically runs composer dev scripts inside containers in the background.

make build: Runs complete build. The same command is run via GitHub Actions CI.

Support

Do you like this library? Star it, share it, show it :) Issues and pull requests are very welcome.

ko-fi

License

Apache 2.0

Cogwheel element: Icons made by TutsPlus from www.flaticon.com is licensed by CC 3.0 BY

Table of contents generated with markdown-toc

json-machine's People

Contributors

a-sync avatar cerbero90 avatar chrysanthos avatar formatz avatar fwolfsjaeger avatar gabimem avatar halaxa avatar laravelfreelancernl avatar mavik avatar nbish11 avatar snapshotpl avatar szepeviktor avatar xedinunknown avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

json-machine's Issues

Memory error too

Hi,

I'm sorry for also reporting, but I too get a memory error.

My source file is 5.6 GB, and I've put it up for download here (the download is 900 MB)

My PHP code is just:

<?php

use JsonMachine\JsonMachine;

require('vendor/autoload.php');

$maxlen = [
	'url' => 0,
	'title' => 0,
	'markdownbody' => 0,
];
$counter = 0;
foreach (JsonMachine::fromFile('23-12-2-sites.json') as $item) {
	foreach ($maxlen as $stat => $max) {
		$maxlen[$stat] = max($max, strlen($item[$stat]));
	}
	$counter++;
	echo "item $counter done\n";
}
echo "Found $counter elements\n";
var_dump($maxlen);

Error:

...
item 3850 done

Fatal error: Allowed memory size of 134217728 bytes exhausted (tried to allocate 60817416 bytes) in D:\...\vendor\halaxa\json-machine\src\Lexer.php on line 57

Call Stack:
    0.0004     406064   1. {main}() D:\...\script.php:0
  320.8091   23634624   2. JsonMachine\Parser->getIterator() D:\...\script.php:24
  320.8148    8966824   3. JsonMachine\Lexer->getIterator() D:\...\vendor\halaxa\json-machine\src\Parser.php:102

The structure is quite simple, it's just an array of basic objects.

It is in theory possible that the json is not valid I suppose, since it's been generated by someone else's script. However, no invalid parsing exception was thrown, possibly because of out-of-memory first.

Could you take a look if I've done something silly? Thanks!

Creating an array from items at the same level

In my use case I want to get all results for a certain nth depth.

 "rest": [
    {
      "mode": "server",
      "resource": [
          "type": "AllergyIntolerance",
...]
      "resource": [
          "type": "MedicationStatement",
...]

So I want to return all resource types:
i.e.
["AllergyIntolerance","MedicationStatement"]

Is that possible?

Allowed memory size exhausted

Hello halaxa!

First of all, thank you for your work!
I would need, if it's possible a bit of help or more examples how to use this lib.
I have a json file (19,1MB) and I'm trying to read it using

foreach (\JsonMachine\JsonMachine::fromFile(BASE_PATH . '/sm/seasons.json') as $key => $value) { var_dump([$key, $value]); }

but my host returns the msg

PHP Fatal error: Allowed memory size of 134217728 bytes exhausted (tried to allocate 20480 bytes) in /***/htdocs/vendor/halaxa/json-machine/src/Parser.php on line 177

Probably there is something that I don't understand right and I cannot use in the right way the library.
Any help will be appreciated!

Thank you!

Re-iterate

Hello there,

  1. I have a problem where I need to iterate through the JSON multiple times. If I start the second iteration attempt I get the following error:

Fatal error: Uncaught JsonMachine\Exception\SyntaxErrorException: Cannot iterate empty JSON '' At position 0. in /var/www/vendor/halaxa/json-machine/src/Parser.php:368

  | Stack trace:
  | #0 /var/www/vendor/halaxa/json-machine/src/Parser.php(245): JsonMachine\Parser->error('Cannot iterate ...', NULL)
  | #1 /var/www/src/Workorder.php(101): JsonMachine\Parser->getIterator()
  | #2 /var/www/src/Workorder.php(87): App\Workorder->count(Object(JsonMachine\Items))
  | #3 /var/www/src/Workflow.php(20): App\Workorder->push()
  | #4 /var/www/public/index.php(11): App\Workflow->execute()
  | #5 {main}
  | thrown in /var/www/vendor/halaxa/json-machine/src/Parser.php on line 368

  1. Is there a way to remove items?

PHP Fatal error: Allowed memory size exhausted

PHP Fatal error: Allowed memory size of 134217728 bytes exhausted (tried to allocate 20480 bytes) in /var/www/html/vendor/halaxa/json-machine/src/Parser.php on line 177

Json-machine version is 0.3.2

I am trying to read file having size of 27mb.

Parsing a google poly dump json

So I'm trying to parse a very large json that is effectively one dimensional element series (though with nested parameters belong to that particular element)

https://drive.google.com/drive/folders/1PA9_Hq1Te7aBUoPurVdICKuCuY5Peka8?usp=sharing

Memory issues seem to happen intermittently. Really would love to be able to just foreach this, but json-machine seems to die with this even after setting PHP to higher memory limits

Adventures so far
https://gist.github.com/yosun/d1ef6ef56943bd2417b07f4970ff7447

Yield an error object instead of exception when possible

The idea is not to throw exception if a parse error occurs inside of the scructure which is about to be yielded. Instead, some kind of parse error object could be yielded so consumer can decide whether to stop iteration or just to skip the errorneous structure and continue.

Usage:

foreach(JsonMachine::fromFile('a.json') as $key => $value) {
    if ($key instanceof JsonError || $value instanceof JsonError) {
        // continue / log / throw ...
    }
    // process $key, $value
}

Push parsing support

Support to incrementally feed the parser via an explicit method call where the pull approach of foreach cannot be used. Useful for example for curl's CURLOPT_WRITEFUNCTION or when receiving json chunks in an event loop.

Proposed usage (implicit):

$items = new PushItems(['pointer' => '/results']);

$callback = function ($jsonChunk) use ($items) {
    $items->push($jsonChunk);
    foreach($items as $item) {
        // process currently available items
    }
}

or more explicit (similar to current API):

$queue = new QueueChunks();
$items = Items::fromQueue($queue, ['pointer' => '/results']);

$callback = function ($jsonChunk) use ($items, $queue) {
    $queue->push($jsonChunk);
    foreach($items as $item) {
        // process currently available items
    }
}

Any other proposal?

CLI script for streaming

@halaxa asked me to start a new thread stemming from the discussion at #73

Hopefully the script described below, which I'll call jm, will provide a basis for discussion or perhaps even the foundation of a script worthy of inclusion in the JSON Machine repository.

The --help option produces quite extensive documentation, so for now I'll just add two points:

  1. The script attempts to make installation as painless as possible for the uninitiated, and I hope that further enhancements in that regard will be added;
  2. Ideally, the script would preserve the precision of all JSON numbers (at least as the default) -- see #87

[ EDIT: The jm script is now in the "jm" repository. ]

Wildcard pointer doesn't work properly

Hello!
I have a problem to using wildcard pointer /results/-/color like in example from root documentation:
as result I receive only 1st iteration and second is not available through the foreach or iterator

$fruits = Items::fromFile('fruitsArray.json', ['pointer' => '/results/-/color']);

  • my code example below:
<?php 

$json = '{"results":[{"name":"apple","color":"red"},{"name":"pear","color":"yellow"},{"name":"some","color":"green"}]}';
$fruits = Items::fromString($json, ['pointer' => "/results/-/color"]);

var_dump(iterator_count($fruits->getIterator())); // result 1

foreach ($fruits as $key => $value) {
	echo "{$value}\n"; // result only red
}
  • also trying to pass the test of package and it's good
  • package version is 1.1.1

P.S: Your library is awesome)

total items

Hi,

How can I know, how many Items has the JSON?

Starts Fast, Gets Slow Towards End?

Never mind, someone (a stupid me) removed an index off our MySQL Join column without letting anyone know, which caused everything to slow down. Fixed that and the entire file parses in 15 minutes. JSON Machine works great!

Json encode?

I'm currently trying to encode a large number of arrays with json.

Does this package only support json_decode? Is there any way to do the same for json_encode?

Not decoding JSON

I guess this sounds strange, but I need to process JSON files of GB in size... and I don't want to decode the nodes to arrays or objects. I just want the JSON!

Obviously I could json_encode the array that's produced, but with millions of transactions it's worrying to put them through an unnecessary step where an error could be introduced in the decode/encode process.

(The background is this: I have millions of user transactions to feed into a webhook. The webhook is expecting JSON formatted exactly in the way the nodes in the JSON blob are formatted. I just need to take each node, feed it to the webhook, check the response, and move onto the next one.)

Any options here?

Deprecated warning

Hello, can you tell me how to avoid this:

2022-05-31T21:19:57+00:00 [info] User Deprecated: Method "IteratorAggregate::getIterator()" might add "\Traversable" as a native return type declaration in the future. Do the same in implementation "JsonMachine\Items" now to avoid errors or add an explicit @return annotation to suppress this message.

Out of memory with Guzzle

A bit surprising this will allocate a lot of memory with large json files, in this example a 181MB one (Found here; https://github.com/zemirco/sf-city-lots-json/blob/master/citylots.json)

<?php

require_once __DIR__ . '/vendor/autoload.php';

$client = new \GuzzleHttp\Client();
$response = $client->request('GET', 'http://127.0.0.1:8001/storage/citylots.json');

// Gets PHP stream resource from Guzzle stream
$phpStream = \GuzzleHttp\Psr7\StreamWrapper::getResource($response->getBody());

foreach (\JsonMachine\JsonMachine::fromStream($phpStream) as $key => $value) {
  //
}
% php memory.php 
PHP Fatal error:  Allowed memory size of 268435456 bytes exhausted (tried to allocate 20480 bytes) in /tmp/test/vendor/halaxa/json-machine/src/Parser.php on line 177
PHP Fatal error:  Allowed memory size of 268435456 bytes exhausted (tried to allocate 20480 bytes) in /tmp/test/vendor/guzzlehttp/promises/src/TaskQueue.php on line 24

I'm 100% guessing its due to json-machine is not registering as the sink for Guzzle
http://docs.guzzlephp.org/en/stable/request-options.html#sink

Basic Usage Using PHP and Brew

Hey!

This is going to be a really basic question, I downloaded the library last week using brew. But having some issues getting it working, never used composer before and when its trying to use the import file command its not working?

I've used the example but it falls over on trying to use the items::from file command.

The code looks like this but unsure how to fix it?

use JsonMachine\Items;
use JsonMachine\JsonDecoder\PassThruDecoder;

$users = Items::fromFile

Change default decoding structure from array to object

This will make it more predictable as json_decode works the same way.
The only thing needed to do is to change the line with default instantiation of ExtJsonDecoder in Parser.

Huge BC break - will wait to version 1.0

Get JSON object properties (non-iterable elements)

I'm loading JSON from the FDA (example query) which contains meta values which are non-arrays. I would like to be able to retrieve the last_updated and total values from this JSON:

{
  "meta": {
    "terms": "https://open.fda.gov/terms/",
    "license": "https://open.fda.gov/license/",
    "last_updated": "2019-12-20",
    "results": {
      "skip": 0,
      "limit": 105845,
      "total": 105845
    }
  },
  "results": [

    "… 105845 records here …"

  ]
}

I've tried to get the values from meta like this, but it doesn't work because meta is not an iterable object:

$meta = \JsonMachine\JsonMachine::fromFile($import_file, '/meta');
foreach ($meta as $key => $val) {
    if ($key == 'last_updated') {
        $last_updated = $val;
    }
}

Is there a way to get these values using JsonMachine?

Edit: I also tried this:

$last_updated = \JsonMachine\JsonMachine::fromFile($import_file, '/meta/last_updated');
echo $last_updated;
echo $last_updated->current();
echo end($last_updated);

I'm not familiar with using IteratorAggregate, so I'm not sure how to get a value from the $last_updated object.

Parser fails with "Syntax error '[' At position N"

// works
// $json = '{"result":{"items":[]}}';

// throws the SyntaxError exception because of `"foo":[]`
$json = '{"result":{"foo":[], "items":[]}}'; 

$items = \JsonMachine\JsonMachine::fromString($json, "/result/items");
foreach ($items as $name => $data) {
        echo $data, "\n";
}

Library should not be tied to specific data reading protocols

This library should not have either of the methods: fromFile() nor fromStream(). Perhaps these could be kept around as a convenience, but a pure implementation should not have methods that deal with reading files or streams, because insodoing, the library only supports files and streams. If I wanted to use this library with a general purpose async framework like Amp or React, I can't, because it doesn't explicitly support them. A well designed implementation would not tie itself to particular data protocols, instead just accepting incomplete JSON fragments from a string buffer, like Duct.

Detect unexpected end of stream

This tool is quite useful for me to synchronize products in my ecommerce, but I need to be able to detect when a transmission ended unexpectedly, in order to cancel the operation.

For example:
This is the format used

{
	"status": "success",
	"data": [
		{"id": 1, "name": ...},
		{"id": 2, "name": ...},
		{"id": 3, "name": ...},
		{"id": 4, "name": ...},
		.
		.
		.
		{"id": N, "name": ...}
	]
}

But when the transmission ends unexpectedly, the format is truncated

{
	"status": "success",
	"data": [
		{"id": 1, "name": ...},
		{"id": 2, "name": ...},
		{"id": 3, "name": ...},
		{"id": 4, "name": ...},
		.
		.
		.
		{"id": X, "name": ...}

Note sometimes the object of the last item is complete, so there is no format error.

When processing, this library ignores that unexpected ending, I need to be able to detect that the json format has not finished correctly.
Is this possible to do?
It would be ideal to throw an exception in that case.

Thank you

CLI support?

Please let me know by reactions/voting or comments if a CLI version of JSON Machine would be useful to have. Thanks.

jm command would take a JSON stream from stdin, and send items one by one to stdout wrapped in a single-item JSON object encoded as {key: value}.

Possible usage:

$ wget <big list of users to stdout> | jm --pointer=/results
{"0": {"name": "Frank Sinatra", ...}}
{"1": {"name": "Ray Charles", ...}}
...

Another idea might be to wrap the item in a JSON list instead of an object, like so:

$ wget <big list of users to stdout> | jm --pointer=/results
[0, {"name": "Frank Sinatra", ...}]
[1, {"name": "Ray Charles", ...}]
...

Performance improvements (lexer/parser)

Hi guys,

the memory usage is awesome but the cpu-time is ~100x compared to json_decode (100MB json with 10000 entries).
Did you consider using a c-extension for the tokenizing/parsing?
Never wrote a extension, but looks like we could extend ext-json
or even just use ext-parle for the heavy lifting.

Could try to implement a lexer with ext-parle and look how the performance changes and then implement a parser if you guys think this is a good idea.

Greeting

Un parse error

Hello, trying to parse large json file and I have an error. Only first object is parsed and then there's some error.

$response = JsonMachine::fromFile(storage_path('app/file.json'), '/products', new ErrorWrappingDecoder(new ExtJsonDecoder()));

this is the code I used (also tried with PassThruDecoder) and json_decode by myself, but it does not work cause all items doesn't have { at the beginning.

attaching test json file.

"identifiers": {} ---> this line cause error.

test.txt

Subtree continues to iterate after last item

Hi @halaxa

Is there a way to stop iterating the subtree? I have a JSON file of 500 GB with 10 subtrees. Right now the code would continue iterating the subtree and thus wasting alot of time doing so.

The problem is there is no way - with the current code base - to know how to break out of the for loop. I would argue that it is most useful that the iteration stops by it self instead of having to code a break your self. What do you think?

Reference: #21

Getting un catchable errors on non JSON files

When calling fromFile on a wrong file format I get some uncatchable errors.
I had this issue with both 0.7.0 and 1.1.1

\JsonMachine\JsonMachine::fromFile($filename, "/myattribute");

I get the following fatal error with a zip file, but I can also have similar issues with text files

message: Undefined variable $P
script: .../halaxa/json-machine/src/Parser.php
line: 115

Testing vars before $tokenType = ${$token[0]}; seems to be helful.

if($token == null || !isset($token[0])|| !isset(${$token[0]})) { throw new JsonMachineException("Error parsing stream."); }

Split Parser into two generators

  1. Rename Lexer to Tokens
  2. Split Parser into 2 parts
    1. SytnaxCheckedTokens - Will iterate Tokens and will only check syntax of tokens and yield it along
    2. PhpItems - Will iterate SyntaxCheckedTokens and yield php structures

It makes possible to skip syntax checking (to gain speed where applicable) very easilly. User can simply remove SyntaxCheckedTokens from generator stack.

It will also pave way to #36.

Option or method for preserving the precision of numeric literals

Currently, the precision of JSON numbers is not, in general, preserved.

Unfortunately, using JSON_BIGINT_AS_STRING, at least by itself, doesn't help, first because it converts all "big integers" to strings (see below), and because it does nothing for "big decimals" (meaning "big or little decimals").

Perhaps asking for the preservation of numeric decimals is too much to ask for in this project; if so,
please interpret this ER as a request for the preservation of integer precision.

--

print json_encode(json_decode("[\"123\", 123]", flags: JSON_BIGINT_AS_STRING))."\n";

print json_encode(json_decode("[\"123000000000000000000000000123\", 123000000000000000000000000123]", flags: JSON_BIGINT_AS_STRING))."\n";

produces:

["123",123]
["123000000000000000000000000123","123000000000000000000000000123"]

whereas for the second array, we want:

["123000000000000000000000000123",123000000000000000000000000123]

no download version 5.6(

[InvalidArgumentException]
Package halaxa/json-machine at version has a PHP requirement incompatible
with your PHP version (5.3.28)

error composer. Help me please

my version php 5.6.36

PHP 5.5 support

@dizirator, I see you need support for PHP 5.5. Would you like to make a pull request to make it officcial in JSON Machine?

passing headers

Hello,

how can you send the header with the request?

$context_re = stream_context_create(array(
    'http' => array(
        'header'  => "Authorization: Basic " . base64_encode($user . ":" . $pass)
    )
));

$json = Items::fromFile($domain . "/exports/missions-published.json", ['debug' => true]);
print_r($json);

Still Memory Error

I tried with 187 Mb of json data and boom! Still get the "Allowed memory size of xxx bytes exhausted (tried to allocate xx bytes)".

All exceptions should extend one base exception

All exceptions in namespace JsonMachine\Exception should extend one common exception, say JsonMachineException, so that userland code can catch for only one type and catch anything from this library. Fell free to create pull request and participate :)

UnexpectedEndSyntaxErrorException

I sometimes get this error reading valid json files. Any idea why this could be happening?
If I load the file into any json validator, there is no error on ','. Sometimes the position is not ',', but another arbitrary part of the json file,

PHP Fatal error: Uncaught JsonMachine\Exception\UnexpectedEndSyntaxErrorException: JSON string ended unexpectedly ',' At position 0. in /.../.../.../.../.../vendor/halaxa/json-machine/src/Parser.php:368 Stack trace: #0 /.../.../.../.../.../vendor/halaxa/json-machine/src/Parser.php(249): JsonMachine\Parser->error() #1 /.../.../.../.../.../.../load.php(92): JsonMachine\Parser->getIterator()

taking time while parsing

foreach while parsing data is taking alot of time on live server rather than local. need help .
here is my code

$jsonFilePath = dirname(__FILE__) . "/cronJob.json";

foreach ($array as $value) {
    $compare[] = "/" . $value;
}
$array = $compare;
try {
    $jsonData = Items::fromFile($jsonFilePath, [
        'pointer' => $array
    ]);
} catch (\Throwable $e) {
    return $dataToFetch;
}

foreach ($jsonData as $key => $value) {
    if ($key == 'id') {
        $id = $value;
    }
    $dataToFetch[$id][$key] = $value;
}
$dataToFetch = array_values($dataToFetch);

return $dataToFetch

Leave PHP 5

Update composer.json
Add scalar typehints
Update phpunit
...

Thank you for this awesome library.

I want to thank you from core of my heart.

This library is just awesome, especially the ease of using decoders and pointers.

A very big thank you.

error when foreach

Hi
if i "$json = JsonMachine::fromFile($_FILES["file"]["tmp_name"]);" a large file and foreach to loop it, it crash

foreach ($json as $e) {
    $element = (object)$e;
}

If i add "var_dump($element);" inside the foreach loop, it back to work. why?

thanks
Peter

how about receiving from http respond as json?

i saw the readme code for

// this often causes Allowed Memory Size Exhausted
- $users = json_decode(file_get_contents('500MB-users.json'));
// this usually takes few kB of memory no matter the file size
+ $users = \JsonMachine\JsonMachine::fromFile('500MB-users.json');

But, what if the file is over network? or perhaps call external https / htpp valid json respond...?
Is it applicable or it will be died as well...?

Speed of \JsonMachine\JsonMachine::fromFile

Hi!

Is it normal for a response to take 8-9 seconds on a 32MB file? When using json_decode, it was maybe a second or two (at the cost of resources).

I am simply using
$array = \JsonMachine\JsonMachine::fromFile($file);
and matching an email with foreach to return all objects containing that email (usually 5-10 objects and 20kB or so).

Cheers

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.