henu / bigjson Goto Github PK
View Code? Open in Web Editor NEWPython library that reads JSON files of any size.
License: MIT License
Python library that reads JSON files of any size.
License: MIT License
Thanks a lot for your code.
Hi, I'm using your package to load the huge wikidata file "latest-all.json", which is about 1TB. It works fine during the loading part. However, it seems that the structure we get is different from the original wikidata structure. For instance, after I got the element Q42 using "element = j[4]", the code "element["claims"]["P31"]" provides me with an array of wikidata entities. However, what I'm supposed to get is an object of Q5. (Q42 is supposed to be connected to Q5 with P31)
I'm wondering whether the reading method follows the original json file structure exactly?
Exception(u'Unexpected bytes!')
is raised while parsing a valid JSON file.
f = open('test.json', 'rb')
bj = bigjson.load(f, 'utf-8')
for obj in bj:
print(obj.to_python())
sample.json schema looks like that
[{"title": "some title", "body": "some text"}, ...]
When I am trying to get an object from a file, I'm getting an error.
In [40]: with open('sample.json', 'rb') as f:
...: bigf = bigjson.load(f)
...:
...:
In [41]: item = bigf[500000]
error trace
ValueError Traceback (most recent call last)
<ipython-input-41-0000354c06e2> in <module>
----> 1 item = bigf[500000]
~/Documents/playground/.env/lib/python3.8/site-packages/bigjson/array.py in __getitem__(self, index)
107 return self.reader.read(read_all=False)
108 else:
--> 109 self.reader.read(read_all=True)
110
111 # Skip comma and whitespace around it
~/Documents/playground/.env/lib/python3.8/site-packages/bigjson/filereader.py in read(self, read_all, to_python)
130 return obj.to_python()
131 else:
--> 132 return Object(self, read_all)
133
134 c = self._peek()
~/Documents/playground/.env/lib/python3.8/site-packages/bigjson/obj.py in __init__(self, reader, read_all)
22 return
23
---> 24 self._read_all()
25
26 def keys(self):
~/Documents/playground/.env/lib/python3.8/site-packages/bigjson/obj.py in _read_all(self, to_python)
120 python_dict[key] = self.reader.read(read_all=True, to_python=True)
121 else:
--> 122 self.reader.read(read_all=True)
123
124 self.length += 1
~/Documents/playground/.env/lib/python3.8/site-packages/bigjson/filereader.py in read(self, read_all, to_python)
82
83 while True:
---> 84 c = self._get()
85
86 if c == b'"':
~/Documents/playground/.env/lib/python3.8/site-packages/bigjson/filereader.py in _get(self)
148
149 def _get(self):
--> 150 self._ensure_readbuf_left(1)
151 if len(self.readbuf) - self.readbuf_read < 1:
152 raise Exception(u'Unexpected end of file when getting next byte!')
~/Documents/playground/.env/lib/python3.8/site-packages/bigjson/filereader.py in _ensure_readbuf_left(self, minimum_left)
193 read_amount = max(minimum_left, FileReader._READBUF_CHUNK_SIZE) - (len(self.readbuf) - self.readbuf_read)
194 self.readbuf_pos += self.readbuf_read
--> 195 old_pos = self.file.tell()
196 self.readbuf = self.readbuf[self.readbuf_read:] + self.file.read(read_amount)
197 self.readbuf_read = 0
ValueError: I/O operation on closed file
I have the same problem as in #4. How to make bigjson run on python3?
https://github.com/henu/bigjson/blame/master/README.md#L16
Disclaimer: I am a Python newbie and sporadic user.
I was trying to use the library as shown, and I kept getting:
C:\Temp>python parseAttempt.py
Traceback (most recent call last):
File "parseAttempt.py", line 23, in <module>
inputJSON = bigjson.load(inputFile)
AttributeError: 'module' object has no attribute 'load'
problem was solved by from bigjson import bigjson
Should that be reflected at readme?
Hey, I'm trying to handle a 14Gb JSON file. This bigjson is so great to handle this dataset. But I want to work with Numpy and Pandas. Is there any way to convert bigjson.array.Array
to pd.DataFrame
or np.Array?
Hi,
thanks for your lib.
Got an error when I "bigjson.load(file)", which is "read of closed file" Any tips for that? Thanks.
I have a file created by a 3rd party user who consistently creates some values as 100.
or the likes of it.
Why is bigjson
having a limitation about interpreting this as a number?
You even have a test to assert that bigjson
catches:
{ "field": 100.}
Why not just assume this is float(100.)
?
I am trying to parse a large json file and preprocess data in it.
f = open("biencoder-nq-train.json","rb")
train_data = bigjson.load(f)
new_train_data = []
for c, val in enumerate(train_data):
temp_dict = {}
temp_dict["answer"] = val['answers'][0]
temp_dict["question"] = val['question']
new_train_data.append(temp_dict)
The for
loop line is throwing a UnicodeEncodeError
107 elif c == b'u':
108 unicode_bytes = self._read(4)
--> 109 string += (b'\\u' + unicode_bytes).decode('unicode_escape').encode(self.encoding)
110 else:
111 raise Exception(u'Unexpected \\{} in backslash encoding! Position {}'.format(c.decode('utf-8'), self.readbuf_read - 1))
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud834' in position 0: surrogates not allowed
How do I resolve this?
Hi,
I tried to use it on some huge corrupted file, but got error:
raise Exception(u'Unexpected bytes!')
I think it would more convenient to print the faulty line number in such case.
I have this json file content
but it's not load 金牌综艺
string utf-8
[{
"name": "金牌综艺",
"logo": "https://parco-zh.github.io/demo/new3.png",
"url": "http://121.31.30.91:8081/ysten-business/live/saishijx/1.m3u8",
"category": null,
"languages": [
{
"code": "zho",
"name": "Chinese"
}
],
"countries": [
{
"code": "cn",
"name": "China"
}
],
"tvg": {
"id": "JinPaiZongYi.cn",
"name": null,
"url": null
}
}]
Hi, when run
with open(file, mode='rb') as jsonfile:
jsonData = bigjson.load(jsonfile)
get the follow error:
UnicodeDecodeError Traceback (most recent call last)
in ()
4
5 with open(file, mode='rb') as jsonfile:
----> 6 jsonData = bigjson.load(jsonfile)
3 frames
/usr/local/lib/python3.6/dist-packages/bigjson/init.py in load(file)
4 def load(file):
5 reader = FileReader(file)
----> 6 return reader.read()
/usr/local/lib/python3.6/dist-packages/bigjson/filereader.py in read(self, read_all, to_python)
20 assert read_all or not to_python
21
---> 22 self._skip_whitespace()
23
24 # None
/usr/local/lib/python3.6/dist-packages/bigjson/filereader.py in _skip_whitespace(self)
134 def _skip_whitespace(self):
135 while True:
--> 136 self._ensure_readbuf_left(1)
137 if len(self.readbuf) - self.readbuf_read < 1:
138 break
/usr/local/lib/python3.6/dist-packages/bigjson/filereader.py in _ensure_readbuf_left(self, minimum_left)
188 read_amount = max(minimum_left, FileReader._READBUF_CHUNK_SIZE) - (len(self.readbuf) - self.readbuf_read)
189 self.readbuf_pos += self.readbuf_read
--> 190 self.readbuf = self.readbuf[self.readbuf_read:] + self.file.read(read_amount).decode('ascii')
191 self.readbuf_read = 0
192
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 5693: ordinal not in range(128)
I'm trying parse big json file and getting error "Exception: Unexpected bytes!"
Similar json structure and code in attachment.
In test script it raising at line 5 in my code ( print(json_obj['nested_list1']) )
Built-in json library parsing this file structure successfully.
Could you please help me with it?
bigjson_test.zip
Code String class.
when i run ' element = j[4]' , it return 'TypeError: Key must be string!';;
then i check the result of j, it return '<bigjson.obj.Object instance at 0x11bee2638>';;
may i have a look at your data, please
hi henu,it is a nice work for reading big json file. I am working with same wiki data as your example.
import bigjson
with open('wikidata-latest-all.json', 'rb') as f:
j = bigjson.load(f)
element = j[4]
But when the index of j becomes larger(2000+),there comes the error as mentioned.
I tried to change the original code in filereader line 119
string = string.decode('unicode_escape').encode('utf-16', 'surrogatepass').decode('utf-16')
to
string = string.decode('utf-8')
and it worked
so everytime i want to read new data, i need to change the encoding manually.
is there a quick way to add the param to load() funtion
Hello,
While reading a json I want to get the whole object for a given loop in python dictionary datatype. At the moment, it is in bigjson.obj.Object form. Is there a method or something to convert it into something I can use. Since I don't know what each keys etc. could be in that object I can't generate a dictionary using some logic or loop.
Thanks beforehand and enjoying the package so far.
I'm loading a json file to a bigjson object, and iterating through items. Consider the following example:
with open('fileA.json`,'rb') as f:
dictA = bigjson.load(f)
for k,v in dictA.iteritems():
list_v = np.array(v)
for v1 in list_v:
item = dictA['v1'] ### this line takes endless time without execution.
the last line never ends execution.
Hi, first of all, thanks for providing this tool. I find it by series of SO.
However, I cannot use it in my code. I said ModuleNotFoundError: No module named 'filereader
when I tried to import it. The whole ipython looks like:
$ :~ ipython3
Python 3.6.3 (default, Oct 6 2017, 00:00:00)
Type 'copyright', 'credits' or 'license' for more information
IPython 6.1.0 -- An enhanced Interactive Python. Type '?' for help.
In [1]: import bigjson
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
<ipython-input-1-dc600f1cd09e> in <module>()
----> 1 import bigjson
/usr/local/lib/python3.6/dist-packages/bigjson-1.0-py3.6.egg/bigjson/__init__.py in <module>()
----> 1 from filereader import FileReader
2
3
4 def load(file):
5 reader = FileReader(file)
ModuleNotFoundError: No module named 'filereader'
I install this module by
$:~ cd bigjson
$:~/bigjson sudo python3.6 setup.py install
Is there anything wrong with my installation?
I am trying to read a file which contains data with NaN values, e.g. {"a": NaN}
. I know this is not valid JSON, however I would like to know if there is a workaround that allows me to read this file anyway without getting this error:
Exception: Unexpected bytes! Value 'N' Position 170
Should I just replace all occurrences of NaN with null
or "NaN"
first or is there a better way to handle this?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.