henu / bigjson Goto Github PK

View Code? Open in Web Editor NEW

191.0 191.0 26.0 42 KB

Python library that reads JSON files of any size.

License: MIT License

Python 100.00%

bigjson's People

Contributors

Stargazers

Watchers

bigjson's Issues

The parsed structure of wikidata

Thanks a lot for your code.

Hi, I'm using your package to load the huge wikidata file "latest-all.json", which is about 1TB. It works fine during the loading part. However, it seems that the structure we get is different from the original wikidata structure. For instance, after I got the element Q42 using "element = j[4]", the code "element["claims"]["P31"]" provides me with an array of wikidata entities. However, what I'm supposed to get is an object of Q5. (Q42 is supposed to be connected to Q5 with P31)

I'm wondering whether the reading method follows the original json file structure exactly?

Error iterating valid JSON file

Exception(u'Unexpected bytes!') is raised while parsing a valid JSON file.

f = open('test.json', 'rb')
bj = bigjson.load(f, 'utf-8')
for obj in bj:
    print(obj.to_python())

test.json.zip

ValueError: I/O operation on closed file

sample.json schema looks like that
[{"title": "some title", "body": "some text"}, ...]

When I am trying to get an object from a file, I'm getting an error.

In [40]: with open('sample.json', 'rb') as f: 
    ...:     bigf = bigjson.load(f) 
    ...:      
    ...:                                                                                                      

In [41]: item = bigf[500000]

error trace

ValueError                                Traceback (most recent call last)
<ipython-input-41-0000354c06e2> in <module>
----> 1 item = bigf[500000]

~/Documents/playground/.env/lib/python3.8/site-packages/bigjson/array.py in __getitem__(self, index)
    107                 return self.reader.read(read_all=False)
    108             else:
--> 109                 self.reader.read(read_all=True)
    110 
    111             # Skip comma and whitespace around it

~/Documents/playground/.env/lib/python3.8/site-packages/bigjson/filereader.py in read(self, read_all, to_python)
    130                 return obj.to_python()
    131             else:
--> 132                 return Object(self, read_all)
    133 
    134         c = self._peek()

~/Documents/playground/.env/lib/python3.8/site-packages/bigjson/obj.py in __init__(self, reader, read_all)
     22             return
     23 
---> 24         self._read_all()
     25 
     26     def keys(self):

~/Documents/playground/.env/lib/python3.8/site-packages/bigjson/obj.py in _read_all(self, to_python)
    120                 python_dict[key] = self.reader.read(read_all=True, to_python=True)
    121             else:
--> 122                 self.reader.read(read_all=True)
    123 
    124             self.length += 1

~/Documents/playground/.env/lib/python3.8/site-packages/bigjson/filereader.py in read(self, read_all, to_python)
     82 
     83             while True:
---> 84                 c = self._get()
     85 
     86                 if c == b'"':

~/Documents/playground/.env/lib/python3.8/site-packages/bigjson/filereader.py in _get(self)
    148 
    149     def _get(self):
--> 150         self._ensure_readbuf_left(1)
    151         if len(self.readbuf) - self.readbuf_read < 1:
    152             raise Exception(u'Unexpected end of file when getting next byte!')

~/Documents/playground/.env/lib/python3.8/site-packages/bigjson/filereader.py in _ensure_readbuf_left(self, minimum_left)
    193         read_amount = max(minimum_left, FileReader._READBUF_CHUNK_SIZE) - (len(self.readbuf) - self.readbuf_read)
    194         self.readbuf_pos += self.readbuf_read
--> 195         old_pos = self.file.tell()
    196         self.readbuf = self.readbuf[self.readbuf_read:] + self.file.read(read_amount)
    197         self.readbuf_read = 0

ValueError: I/O operation on closed file

Bigjson doesn't work on python3.

I have the same problem as in #4. How to make bigjson run on python3?

Clarity at README

https://github.com/henu/bigjson/blame/master/README.md#L16

Disclaimer: I am a Python newbie and sporadic user.

I was trying to use the library as shown, and I kept getting:

C:\Temp>python parseAttempt.py
Traceback (most recent call last):
  File "parseAttempt.py", line 23, in <module>
    inputJSON = bigjson.load(inputFile)
AttributeError: 'module' object has no attribute 'load'

problem was solved by from bigjson import bigjson

Should that be reflected at readme?

Numpy, Pandas support?

Hey, I'm trying to handle a 14Gb JSON file. This bigjson is so great to handle this dataset. But I want to work with Numpy and Pandas. Is there any way to convert bigjson.array.Array to pd.DataFrame or np.Array?

read of closed file

Hi,
thanks for your lib.

Got an error when I "bigjson.load(file)", which is "read of closed file" Any tips for that? Thanks.

MISSING_DIGIT_AFTER_DOT_JSON_FILE

I have a file created by a 3rd party user who consistently creates some values as 100. or the likes of it.

Why is bigjson having a limitation about interpreting this as a number?
You even have a test to assert that bigjson catches:

{ "field": 100.}

Why not just assume this is float(100.)?

UnicodeEncodeError: 'utf-8' codec can't encode character

I am trying to parse a large json file and preprocess data in it.

f =  open("biencoder-nq-train.json","rb")
train_data = bigjson.load(f)
new_train_data = []
for c, val in enumerate(train_data):
    temp_dict = {}
    temp_dict["answer"] = val['answers'][0]
    temp_dict["question"] = val['question']
    new_train_data.append(temp_dict)

The for loop line is throwing a UnicodeEncodeError

107                     elif c == b'u':
    108                         unicode_bytes = self._read(4)
--> 109                         string += (b'\\u' + unicode_bytes).decode('unicode_escape').encode(self.encoding)
    110                     else:
    111                         raise Exception(u'Unexpected \\{} in backslash encoding! Position {}'.format(c.decode('utf-8'), self.readbuf_read - 1))

UnicodeEncodeError: 'utf-8' codec can't encode character '\ud834' in position 0: surrogates not allowed

How do I resolve this?

More details on parse error.

Hi,

I tried to use it on some huge corrupted file, but got error:

raise Exception(u'Unexpected bytes!')

I think it would more convenient to print the faulty line number in such case.

json file content utf-8 not load real string

Hi,

I have this json file content
but it's not load 金牌综艺 string utf-8

[{
  "name": "金牌综艺",
  "logo": "https://parco-zh.github.io/demo/new3.png",
  "url": "http://121.31.30.91:8081/ysten-business/live/saishijx/1.m3u8",
  "category": null,
  "languages": [
    {
      "code": "zho",
      "name": "Chinese"
    }
  ],
  "countries": [
    {
      "code": "cn",
      "name": "China"
    }
  ],
  "tvg": {
    "id": "JinPaiZongYi.cn",
    "name": null,
    "url": null
  }
}]

Error: UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 5693: ordinal not in range(128)

Hi, when run

with open(file, mode='rb') as jsonfile:
jsonData = bigjson.load(jsonfile)

get the follow error:

UnicodeDecodeError Traceback (most recent call last)

in ()
4
5 with open(file, mode='rb') as jsonfile:
----> 6 jsonData = bigjson.load(jsonfile)

3 frames

/usr/local/lib/python3.6/dist-packages/bigjson/init.py in load(file)
4 def load(file):
5 reader = FileReader(file)
----> 6 return reader.read()

/usr/local/lib/python3.6/dist-packages/bigjson/filereader.py in read(self, read_all, to_python)
20 assert read_all or not to_python
21
---> 22 self._skip_whitespace()
23
24 # None

/usr/local/lib/python3.6/dist-packages/bigjson/filereader.py in _skip_whitespace(self)
134 def _skip_whitespace(self):
135 while True:
--> 136 self._ensure_readbuf_left(1)
137 if len(self.readbuf) - self.readbuf_read < 1:
138 break

/usr/local/lib/python3.6/dist-packages/bigjson/filereader.py in _ensure_readbuf_left(self, minimum_left)
188 read_amount = max(minimum_left, FileReader._READBUF_CHUNK_SIZE) - (len(self.readbuf) - self.readbuf_read)
189 self.readbuf_pos += self.readbuf_read
--> 190 self.readbuf = self.readbuf[self.readbuf_read:] + self.file.read(read_amount).decode('ascii')
191 self.readbuf_read = 0
192

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 5693: ordinal not in range(128)

Error parsing JSON file

I'm trying parse big json file and getting error "Exception: Unexpected bytes!"
Similar json structure and code in attachment.
In test script it raising at line 5 in my code ( print(json_obj['nested_list1']) )

Built-in json library parsing this file structure successfully.
Could you please help me with it?
bigjson_test.zip

Memory limits the size of strings the program can read.

Code String class.

got <bigjson.obj.Object instance at 0x11bee2638>

when i run ' element = j[4]' , it return 'TypeError: Key must be string!';;
then i check the result of j, it return '<bigjson.obj.Object instance at 0x11bee2638>';;

may i have a look at your data, please

UnicodeEncodeError: 'unicodeescape' codec can't decode bytes in position 7-9: truncated \UXXXXXXXX escape

hi henu,it is a nice work for reading big json file. I am working with same wiki data as your example.

import bigjson
with open('wikidata-latest-all.json', 'rb') as f:
j = bigjson.load(f)
element = j[4]

But when the index of j becomes larger(2000+),there comes the error as mentioned.
I tried to change the original code in filereader line 119
string = string.decode('unicode_escape').encode('utf-16', 'surrogatepass').decode('utf-16')
to
string = string.decode('utf-8')
and it worked

so everytime i want to read new data, i need to change the encoding manually.
is there a quick way to add the param to load() funtion

Retrieve actual object out of bigjson.obj.Object

Hello,

While reading a json I want to get the whole object for a given loop in python dictionary datatype. At the moment, it is in bigjson.obj.Object form. Is there a method or something to convert it into something I can use. Since I don't know what each keys etc. could be in that object I can't generate a dictionary using some logic or loop.

Thanks beforehand and enjoying the package so far.

Make it possible to convert Object/Array into dict/list

fetching from object takes endless time

I'm loading a json file to a bigjson object, and iterating through items. Consider the following example:

with open('fileA.json`,'rb') as f:
   dictA = bigjson.load(f)
   for k,v in dictA.iteritems():
      list_v = np.array(v)
      for v1 in list_v:
         item = dictA['v1'] ### this line takes endless time without execution.

the last line never ends execution.

Import but said: "ModuleNotFoundError: No module named 'filereader"

Hi, first of all, thanks for providing this tool. I find it by series of SO.

However, I cannot use it in my code. I said ModuleNotFoundError: No module named 'filereader when I tried to import it. The whole ipython looks like:

$ :~ ipython3
Python 3.6.3 (default, Oct  6 2017, 00:00:00)
Type 'copyright', 'credits' or 'license' for more information
IPython 6.1.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import bigjson
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-1-dc600f1cd09e> in <module>()
----> 1 import bigjson

/usr/local/lib/python3.6/dist-packages/bigjson-1.0-py3.6.egg/bigjson/__init__.py in <module>()
----> 1 from filereader import FileReader
      2
      3
      4 def load(file):
      5     reader = FileReader(file)

ModuleNotFoundError: No module named 'filereader'

I install this module by

$:~ cd bigjson
$:~/bigjson sudo python3.6 setup.py install

Is there anything wrong with my installation?

Dealing with NaN in json files

I am trying to read a file which contains data with NaN values, e.g. {"a": NaN}. I know this is not valid JSON, however I would like to know if there is a workaround that allows me to read this file anyway without getting this error:

Exception: Unexpected bytes! Value 'N' Position 170

Should I just replace all occurrences of NaN with null or "NaN" first or is there a better way to handle this?

henu / bigjson Goto Github PK

bigjson's People

Contributors

Stargazers

Watchers

Forkers

bigjson's Issues

Hi,

Recommend Projects

Recommend Topics

Recommend Org