Question

我正在尝试使用JSON数据（3.1M +记录）读取文件。我试图在读取整个文件与逐行读取文件之间测试内存和时间效率。

File1 是序列化的JSON数据，是一个包含3.1M +字典且大小为811M的列表。

File2 是序列化的JSON数据，每行都有一个字典。总共有3.1M +线，大小为480M。

阅读file1时的个人资料信息

(flask)chitturiLaptop:data kiran$ python -m cProfile read_wholefile.json 
3108779
Filename: read_wholefile.json

Line #    Mem usage    Increment   Line Contents
================================================
 5      9.4 MiB      0.0 MiB   @profile
 6                             def read_file():
 7      9.4 MiB      0.0 MiB     f = open("File1.json")
 8   3725.3 MiB   3715.9 MiB     f_json  = json.loads(f.read())
 9   3725.3 MiB      0.0 MiB     print len(f_json)


     23805 function calls (22916 primitive calls) in 30.230 seconds

阅读file2时的个人资料信息

(flask)chitturiLaptop:data kiran$ python -m cProfile read_line_by_line.json 
3108779
Filename: read_line_by_line.json

 Line #    Mem usage    Increment   Line Contents
 ================================================
 4      9.4 MiB      0.0 MiB   @profile
 5                             def read_file():
 6      9.4 MiB      0.0 MiB     data_json = []
 7      9.4 MiB      0.0 MiB     with open("File2.json") as f:
 8   3726.2 MiB   3716.8 MiB       for line in f:
 9   3726.2 MiB      0.0 MiB         data_json.append(json.loads(line))
10   3726.2 MiB      0.0 MiB     print len(data_json)


     28002875 function calls (28001986 primitive calls) in 244.282 seconds

根据这个SO post，不应该花费更少的内存来迭代文件2吗？读取整个文件并通过JSON加载它也花费的时间更少。

我在MAC OSX 10.8.5上运行python 2.7.2。

修改

json.load的个人资料信息

(flask)chitturiLaptop:data kiran$ python -m cProfile read_wholefile.json 
3108779
Filename: read_wholefile.json

Line #    Mem usage    Increment   Line Contents
================================================
 5      9.4 MiB      0.0 MiB   @profile
 6                             def read_file():
 7      9.4 MiB      0.0 MiB     f = open("File1.json")
 8   3725.3 MiB   3715.9 MiB     f_json  = json.load(f)
 9   3725.3 MiB      0.0 MiB     print len(f_json)
10   3725.3 MiB      0.0 MiB     f.close()


     23820 function calls (22931 primitive calls) in 27.266 seconds

EDIT2

支持答案的一些统计数据。

(flask)chitturiLaptop:data kiran$ python -m cProfile read_wholefile.json 
3108779
Filename: read_wholefile.json

Line #    Mem usage    Increment   Line Contents
================================================
 5      9.4 MiB      0.0 MiB   @profile
 6                             def read_file():
 7      9.4 MiB      0.0 MiB     f = open("File1.json")
 8    819.9 MiB    810.6 MiB     serialized = f.read()
 9   4535.8 MiB   3715.9 MiB     deserialized  = json.loads(serialized)
10   4535.8 MiB      0.0 MiB     print len(deserialized)
11   4535.8 MiB      0.0 MiB     f.close()


     23856 function calls (22967 primitive calls) in 26.815 seconds

Answer 1

你的第一个测试没有显示通过将整个文件读入一个巨大的字符串所消耗的内存，因为在源代码行完成之前丢弃了巨大的字符串，并且分析器没有在行中间显示内存消耗。如果将字符串保存到变量：

serialized = f.read()
deserialized = json.loads(serialized)

你会看到临时字符串的811 MB内存消耗。您在两个测试中看到的~3725 MB主要是反序列化的数据结构，在两个测试中都是相同的。

最后，请注意json.load(f)是一种更快，更简洁，更友好的方式来加载文件中的JSON数据，而不是json.loads(f.read())或逐行迭代。

Python逐行读取整个文件 - 内存统计信息

1 个答案: