Question

我目前正在使用以下代码将大型 CSV 文件转换为 JSON 文件。

import csv 
import json 

def csv_to_json(csvFilePath, jsonFilePath):
    jsonArray = []
      
    with open(csvFilePath, encoding='utf-8') as csvf: 
        csvReader = csv.DictReader(csvf) 

        for row in csvReader: 
            jsonArray.append(row)
    with open(jsonFilePath, 'w', encoding='utf-8') as jsonf: 
        jsonString = json.dumps(jsonArray, indent=4)
        jsonf.write(jsonString)
          
csvFilePath = r'test_data.csv'
jsonFilePath = r'test_data.json'
csv_to_json(csvFilePath, jsonFilePath)

此代码运行良好，我能够将 CSV 转换为 JSON，没有任何问题。但是，由于 CSV 文件包含 600,000 多行，因此在我的 JSON 中包含许多项目，因此管理 JSON 文件变得非常困难。

我想修改上面的代码，这样对于 CSV 的每 5000 行，数据就会写入一个新的 JSON 文件。理想情况下，在这种情况下，我将拥有 120 (600,000/5000) 个 JSON 文件。

我该怎么做？

Answer 1

拆分你的读写方法并添加一个简单的阈值：

JSON_ENTRIES_THRESHOLD = 5000  # modify to whatever you see suitable

def write_json(json_array, filename):
    with open(filename, 'w', encoding='utf-8') as jsonf: 
        json.dump(json_array, jsonf)  # note the usage of .dump directly to a file descriptor

def csv_to_json(csvFilePath, jsonFilePath):
    jsonArray = []

    with open(csvFilePath, encoding='utf-8') as csvf: 
        csvReader = csv.DictReader(csvf) 
        filename_index = 0
    
        for row in csvReader:
            jsonArray.append(row)
            if len(jsonArray) >= JSON_ENTRIES_THRESHOLD:
                # if we reached the treshold, write out
                write_json(jsonArray, f"jsonFilePath-{filename_index}.json")
                filename_index += 1
                jsonArray = []
            
        # Finally, write out the remainder
        write_json(jsonArray, f"jsonFilePath-{filename_index}.json")

使用 Python 将大型 CSV 文件转换为多个 JSON 文件

1 个答案: