Question

我有一个大的JSON文件〜5GB，但它不是由一个JSON文件组成，而是有几个连接在一起。

{"created_at":"Mon Jan 13 20:01:57 +0000 2014","id":422820833807970304,"id_str":"422820833807970304"}
{"created_at":"Mon Jan 13 20:01:57 +0000     2014","id":422820837545500672,"id_str":"422820837545500672"}.....

大括号之间没有换行符} {。

我尝试使用sed用换行符替换花括号，然后使用：

读取文件

data=[]
for line in open(filename,'r').readline():
data.append(json.loads(line))

但这不起作用。

如何相对快速地阅读此文件？

任何帮助非常感谢

Answer 1

这是一个黑客。它不会将整个文件加载到内存中。我真的希望你使用Python 3。

DecodeLargeJSON.py

from DecodeLargeJSON import *
import io
import json

# create a file with two jsons
f = io.StringIO()
json.dump({1:[]}, f)
json.dump({2:"hallo"}, f)
print(repr(f.getvalue()))
f.seek(0) 

# decode the file f. f could be any file from here on. f.read(...) should return str
o1, idx1 = json.loads(FileString(f), cls = BigJSONDecoder)
print(o1) # this is the loaded object
# idx1 is the index that the second object begins with
o2, idx2 = json.loads(FileString(f, idx1), cls = BigJSONDecoder)
print(o2)

如果您发现某些无法解码的对象，那么您可以告诉我，我们可以找到解决方案。

免责声明这不是有效且最佳的解决方案。这是一个黑客，展示如何使它成为可能。

讨论因为它没有将整个文件加载到内存中，所以正则表达式不起作用。它还使用Python实现而不是C实现。这可能会让它变慢。我真的很讨厌这个简单的任务很难。希望其他人指出一个不同的解决方案。

在Python中读取大型JSON文件

1 个答案: