Question

我需要在S3中将一个可能包含~100 million records的大文本文件拆分为多个文件，并将各个文件另存为S3作为.txt文件。这些记录没有定界，并且可以基于开始和结束位置来标识每一列。每条记录的长度因“类型”而异，“类型”是一个具有固定开始/结束位置的字符串，我需要根据“类型”的值将此文件拆分为多个文件。

例如

My name is Chris  age 45  
My name is Denni  age 46  
My name is Vicki  age 47  
My name is Denni  age 51  
My name is Chris  age 52

在上面的示例中，假设我的“记录类型”从第12位开始，到第17位结束。通过一系列步骤，

1. I need to get a distinct list of record types, which in this case are "Chris", "Denni" and "Vicki"

2. I need to split this file into 3 files, one for each record type and save them with same name as record types. Chris.txt, Denni.txt and Vicki.txt

所需的输出：

Chris.txt：

My name is Chris  age 45  
My name is Chris  age 52

Denni.txt：

My name is Denni  age 46  
My name is Denni  age 51

Vicki.txt：

My name is Vicki  age 47

我正在使用pyspark数据帧来实现这一目标，而我现在所拥有的就是这样，

df_inter =df.select(df.value.substr(start,end).alias("Type"),df.value.alias("value"))

    df_types = df_inter.select("Type").distinct()
    type_count = df_types.count()

    while(i<type_count):
      type = df_types.select(df_types.Type).collect()[i][0]
      df_filtered = df_inter.filter(df_inter["Type"] == type)
      df_filtered.saveAsTextFile("path")
      i += 1

当前代码可以工作，但是需要~25 mins来处理具有5个节点2.5 gb file EMR集群的r5.xlarge，并且要花费更长的时间来处理一个25 GB文件。我想了解是否有更有效的方法来减少处理时间。感谢您的输入。

Answer 1

我假设您的数据由制表符分隔。您可以将整个数据加载到数据框中，如下所示：

df = spark.read.format("com.databricks.spark.csv") \
  .option("mode", "DROPMALFORMED") \
  .option("header", "false") \
  .option("inferschema", "true") \
  .option("delimiter", '\t').load(PATH_TO_FILE)

+---+----+---+-----+---+---+
|_c0| _c1|_c2|  _c3|_c4|_c5|
+---+----+---+-----+---+---+
| My|name| is|Chris|age| 45|
| My|name| is|Denni|age| 46|
| My|name| is|Vicki|age| 47|
| My|name| is|Denni|age| 51|
| My|name| is|Chris|age| 52|
+---+----+---+-----+---+---+

from pyspark.sql.functions import col

Then you can filter the dataframe data and split into multiple dataframe depending on your column value.

Chris_df=df.filter(col('_c3')=='Chris')
+---+----+---+-----+---+---+
|_c0| _c1|_c2|  _c3|_c4|_c5|
+---+----+---+-----+---+---+
| My|name| is|Chris|age| 45|
| My|name| is|Chris|age| 52|
+---+----+---+-----+---+---+
Denni_df=df.filter(col('_c3')=='Denni')
+---+----+---+-----+---+---+
|_c0| _c1|_c2|  _c3|_c4|_c5|
+---+----+---+-----+---+---+
| My|name| is|Denni|age| 46|
| My|name| is|Denni|age| 51|
+---+----+---+-----+---+---+
Vicki_df=df.filter(col('_c3')=='Vicki')

+---+----+---+-----+---+---+
|_c0| _c1|_c2|  _c3|_c4|_c5|
+---+----+---+-----+---+---+
| My|name| is|Vicki|age| 47|
+---+----+---+-----+---+---+

我希望这项工作能更快！

Answer 2

您在这里！您可以在特定位置解析普通的Mainframe文件，并将其分隔为csv。

import csv

PlainTextfile  = 'InputFilePathLocation\Input_File.txt'
CSV_OutputFile = 'OutputFilePathLocation\Output_File.txt'
cols = [(0,2),(3,8),(8,10),(11,17),(17,22),(22,24)]

with open(PlainTextfile,'r') as fin, open(CSV_OutputFile, 'wt') as fout:
    writer = csv.writer(fout, delimiter=",", lineterminator="\n")
    for line in fin:
            line = line.rstrip()  # removing the '\n' and other trailing whitespaces
            data = [line[c[0]:c[1]] for c in cols]
            print("data:",data)
            writer.writerow(data)

您的输出文件现在变为：

My,name ,is,Chris , age ,45
My,name ,is,Denni , age ,46
My,name ,is,Vicki , age ,47
My,name ,is,Denni , age ,51
My,name ,is,Chris , age ,52

，然后您可以将此定界的csv文件加载到数据帧或RDD中，并可以使用过滤器操作将其拆分为不同的数据帧，或者可以使用数据帧Writer类方法将其写入不同的csv文件中。如果您需要进一步的详细信息，请告诉我。

Pyspark-将大文本文件拆分为多个文件

2 个答案: