根据其他列值更新数据框的列

时间:2017-06-09 09:38:15

标签: scala apache-spark apache-spark-sql

我尝试使用 Scala 中的其他列值来更新列的值。

这是我的数据框中的数据:

+-------------------+------+------+-----+------+----+--------------------+-----------+
|UniqueRowIdentifier|   _c0|   _c1|  _c2|   _c3| _c4|                 _c5|isBadRecord|
+-------------------+------+------+-----+------+----+--------------------+-----------+
|                  1|     0|     0| Name|     0|Desc|                    |          0|
|                  2|  2.11| 10000|Juice|     0| XYZ|2016/12/31 : Inco...|          0|
|                  3|-0.500|-24.12|Fruit|  -255| ABC| 1994-11-21 00:00:00|          0|
|                  4| 0.087|  1222|Bread|-22.06|    | 2017-02-14 00:00:00|          0|
|                  5| 0.087|  1222|Bread|-22.06|    |                    |          0|
+-------------------+------+------+-----+------+----+--------------------+-----------+

此处 _c5 列包含的值不正确(Row2中的值包含字符串不正确),我希望将其isBadRecord字段更新为1。

有没有办法更新这个字段?

3 个答案:

答案 0 :(得分:2)

您可以使用http://yazilimsozluk.com/a.xlsx api并使用满足您需求的withColumn之一来填写错误记录。

对于您的情况,您可以编写if (Request != null) { HttpPostedFileBase file = Request.Files["UploadedFile"]; if ((file != null) && (file.ContentLength > 0) && !string.IsNullOrEmpty(file.FileName)) { string fileName = file.FileName; string fileContentType = file.ContentType; byte[] fileBytes = new byte[file.ContentLength]; var data = file.InputStream.Read(fileBytes, 0, Convert.ToInt32(file.ContentLength)); var existingFile = new System.IO.FileInfo(fileName); var package = new OfficeOpenXml.ExcelPackage(existingFile); OfficeOpenXml.ExcelWorksheet workSheet = package.Workbook.Worksheets[0]; for (int i = workSheet.Dimension.Start.Column; i <= workSheet.Dimension.End.Column; i++) { for (int j = workSheet.Dimension.Start.Row; j <= workSheet.Dimension.End.Row; j++) { object cellValue = workSheet.Cells[i, j].Value; } } } } 函数

udf

并将其命名为

def fillbad = udf((c5 : String) => if(c5.contains("Incorrect")) 1 else 0)

答案 1 :(得分:2)

我建议你像在SQL中一样思考它,而不是推理更新它。你可以做到以下几点:

import org.spark.sql.functions.when

val spark: SparkSession = ??? // your spark session
val df: DataFrame = ??? // your dataframe

import spark.implicits._

df.select(
  $"UniqueRowIdentifier", $"_c0", $"_c1", $"_c2", $"_c3", $"_c4",
  $"_c5", when($"_c5".contains("Incorrect"), 1).otherwise(0) as "isBadRecord")

这是一个自包含的脚本,您可以复制并粘贴到Spark shell上以在本地查看结果:

import org.apache.spark.sql.Row
import org.apache.spark.sql.types._

sc.setLogLevel("ERROR")

val schema = 
  StructType(Seq(
    StructField("UniqueRowIdentifier", IntegerType),
    StructField("_c0", DoubleType),
    StructField("_c1", DoubleType),
    StructField("_c2", StringType),
    StructField("_c3", DoubleType),
    StructField("_c4", StringType),
    StructField("_c5", StringType),
    StructField("isBadRecord", IntegerType)))

val contents =
  Seq(
    Row(1,  0.0  ,     0.0 ,  "Name",    0.0, "Desc",                       "", 0),
    Row(2,  2.11 , 10000.0 , "Juice",    0.0,  "XYZ", "2016/12/31 : Incorrect", 0),
    Row(3, -0.5  ,   -24.12, "Fruit", -255.0,  "ABC",    "1994-11-21 00:00:00", 0),
    Row(4,  0.087,  1222.0 , "Bread",  -22.06,    "",    "2017-02-14 00:00:00", 0),
    Row(5,  0.087,  1222.0 , "Bread",  -22.06,    "",                       "", 0)
  )

val df = spark.createDataFrame(sc.parallelize(contents), schema)

df.show()

val withBadRecords =
  df.select(
    $"UniqueRowIdentifier", $"_c0", $"_c1", $"_c2", $"_c3", $"_c4",
    $"_c5", when($"_c5".contains("Incorrect"), 1).otherwise(0) as "isBadRecord")

withBadRecords.show()

以下是相关的输出:

+-------------------+-----+-------+-----+------+----+--------------------+-----------+
|UniqueRowIdentifier|  _c0|    _c1|  _c2|   _c3| _c4|                 _c5|isBadRecord|
+-------------------+-----+-------+-----+------+----+--------------------+-----------+
|                  1|  0.0|    0.0| Name|   0.0|Desc|                    |          0|
|                  2| 2.11|10000.0|Juice|   0.0| XYZ|2016/12/31 : Inco...|          0|
|                  3| -0.5| -24.12|Fruit|-255.0| ABC| 1994-11-21 00:00:00|          0|
|                  4|0.087| 1222.0|Bread|-22.06|    | 2017-02-14 00:00:00|          0|
|                  5|0.087| 1222.0|Bread|-22.06|    |                    |          0|
+-------------------+-----+-------+-----+------+----+--------------------+-----------+

+-------------------+-----+-------+-----+------+----+--------------------+-----------+
|UniqueRowIdentifier|  _c0|    _c1|  _c2|   _c3| _c4|                 _c5|isBadRecord|
+-------------------+-----+-------+-----+------+----+--------------------+-----------+
|                  1|  0.0|    0.0| Name|   0.0|Desc|                    |          0|
|                  2| 2.11|10000.0|Juice|   0.0| XYZ|2016/12/31 : Inco...|          1|
|                  3| -0.5| -24.12|Fruit|-255.0| ABC| 1994-11-21 00:00:00|          0|
|                  4|0.087| 1222.0|Bread|-22.06|    | 2017-02-14 00:00:00|          0|
|                  5|0.087| 1222.0|Bread|-22.06|    |                    |          0|
+-------------------+-----+-------+-----+------+----+--------------------+-----------+

答案 2 :(得分:1)

最好的选择是创建UDF并尝试将其转换为Date格式。 如果可以转换,则返回0,否则返回1

即使你有一个糟糕的日期格式

,这项工作也是如此
      val spark = SparkSession.builder().master("local")
        .appName("test").getOrCreate()

      import spark.implicits._

//create test dataframe
      val data = spark.sparkContext.parallelize(Seq(
        (1,"1994-11-21 Xyz"),
        (2,"1994-11-21 00:00:00"),
        (3,"1994-11-21 00:00:00")
      )).toDF("id", "date")

// create udf which tries to convert to date format
// returns 0 if success and returns 1 if failure 
      val check = udf((value: String) => {
        Try(new SimpleDateFormat("yyyy-MM-dd HH:mm:ss").parse(value)) match {
          case Success(d) => 1
          case Failure(e) => 0
        }
      })

// Add column 
      data.withColumn("badData", check($"date")).show

希望这有帮助!