Question

我正在尝试在Spark Python中以CSV格式保存以下文件。文件：

44,8602,37.19

35,5368,65.89

2,3391,40.64

47,6694,14.98

29,680,13.08

   import csv
   import StringIo
   inputa =sc.textFile("file:///home/kumar/customer-orders.csv")
   def writerecord(line):
       output=StringIO.StringIO()
       writer=csv.DictWriter(output,fieldnames["a","b","c"],extrasaction='ignore',delimiter = ',')
       for record in line:
           writer.writerow(record)
       return [output.getvalue()]
    inputa.mapPartitions(writerecord).saveAsTextFile("file:////home/godffddf2888/resultcx12"")

我收到以下错误：

Error process()
File "/usr/hdp/2.3.4.0-3485/spark/python/lib/pyspark.zip/pyspark/worker.py", line 106, in process
  serializer.dump_stream(func(split_index, iterator), outfile)
       File "/usr/hdp/2.3.4.0-3485/spark/python/pyspark/rdd.py", line 2355, in pipeline_func
return func(split, prev_func(split, iterator))
File "/usr/hdp/2.3.4.0-3485/spark/python/pyspark/rdd.py", line 317, in func
    return f(iterator)
 File "<stdin>", line 5, in writerecord
 File "/usr/lib64/python2.7/csv.py", line 148, in writerow
  return self.writer.writerow(self._dict_to_list(rowdict))
 File "/usr/lib64/python2.7/csv.py", line 145, in _dict_to_list
  return [rowdict.get(key, self.restval) for key in self.fieldnames]
  AttributeError: 'unicode' object has no attribute 'get'

Answer 1

我希望这段代码符合您的要求：

import csv
import io
from pyspark import SparkContext

sc = SparkContext()

inputa =sc.textFile("./customer-orders.csv")

def toCSVlines(data):
    return ','.join(str(i) for i in data)
# function useful for saving data

columns_name = [["a","b","c"]]
header = sc.parallelize(columns_name) # RDD with columns name

rdd = inputa.mapPartitions(lambda x: csv.reader(x)) # read the data of your input .csv file
print (rdd.collect())
# [['44', '8602', '37.19'], ['35', '5368', '65.89'], ['2', '3391', '40.64'], ['47', '6694', '14.98'], ['29', '680', '13.08']]

rdd_tot = header.union(rdd) # create a single RDD with column names and values

print (rdd_tot.collect())
# [['a', 'b', 'c'], ['44', '8602', '37.19'], ['35', '5368', '65.89'], ['2', '3391', '40.64'], ['47', '6694', '14.98'], ['29', '680', '13.08']]

rdd_tot = rdd_tot.map(lambda line: toCSVlines(line))
rdd_tot.coalesce(1).saveAsTextFile("/whereverYouWant")
# you will have something like this
# a,b,c
# 44,8602,37.19
# 35,5368,65.89
# 2,3391,40.64
# 47,6694,14.98
# 29,680,13.08

Answer 2

要将csv文件读入数据框（并添加列）：

df = spark.read.csv(path='my_csv_file.csv',
                    schema=['cola', 'colb', 'colc'])

您可以操作该数据帧上的数据，完成后可以用csv格式写回来：

df.write.csv(path='output_file.csv', mode='overwrite')

Answer 3

您可以尝试使用flatMap（）而不是mapPartitions（）。

Spark Python CSV写

3 个答案: