Question

我有一个csv文件;我在pyspark中转换为DataFrame（df）;经过一番改造;我想在df中添加一列;这应该是简单的行id（从0或1开始到N）。

我在rdd中转换了df并使用＆＃34; zipwithindex＆＃34;。我将生成的rdd转换回df。这种方法有效，但它产生了250k的任务，并且需要花费大量的时间来执行。我想知道是否还有其他方法可以减少运行时间。

以下是我的代码片段;我正在处理的csv文件很大;包含数十亿行。

class ContainerBuilder(f1: Option[String] = None,
                       f2: Option[Boolean] = None,
                       f3: Option[Int] = None,
                       f4: Option[String] = None,
                       ...,
                       fieldsSet = 0) {
  // return a copy of this ContainerBuilder - holding any fields already set - 
  // with f1 now set, and the count of set fields incremented: 
  def fromF1(f1: F1) = copy(f1 = Some(f1), fieldsSet = fieldsSet + 1)

  // Likewise but setting the f2 field:
  def fromF2(f2: F2) = copy(f2 = Some(f2), fieldsSet = fieldsSet + 1)

  ...

  def build = if (readyToBuild) Container(f1,f2,...) else ... // for 'unready' cases, you can throw an exception, or change build to return an Option[Container], or whatever.

  def readyToBuild = fieldsSet > 2 // Ensures at least 3 fields set - change to whatever criteria you need.
}

Answer 1

您也可以使用sql包中的函数。它将生成一个唯一的id，但它不会是顺序的，因为它取决于分区的数量。我相信它可以在Spark 1.5 +

中使用

from pyspark.sql.functions import monotonicallyIncreasingId

# This will return a new DF with all the columns + id
res = df.withColumn("id", monotonicallyIncreasingId())

编辑：19/1/2017

由@Sean

使用monotonically_increasing_id()代替Spark 1.6和

如何在pySpark数据帧中添加Row id

1 个答案: