python - Pyspark - 避免数组错误 java.lang.OutOfMemoryError：请求的数组大小超出 VM 限制

我有一个旧版应用程序，它会在几个月后被弃用。它在数组中逐行读取文件的内容，对于一些意外的较大文件，应用程序会抛出 java.lang.OutOfMemoryError: Requested array size exceeds VM limit 异常。

Update: This code runs as a PySpark job in AWS Glue. Hence, a java exception is being reported.

不费吹灰之力，想暂时解决这个问题。

我想知道检查数组的大小（或长度）以查看它是否超过最大限制是否是避免发生 java.lang.OutOfMemoryError: Requested array size exceeds VM limit 错误的好主意？

# Pyspark code below
log_lines = []
for line in log_file:
  log_lines.append(line)
  if(len(log_lines)>100000) # limit the size of array to be 100000
     process_part(log_lines) # process the array i.e. write it somewhere and continue
     log_lines = []

if len(log_lines)>0:
   process_part(log_lines)

Pyspark - 避免数组错误 java.lang.OutOfMemoryError：请求的数组大小超出 VM 限制

0 个答案: