替换Spark数据框中的列名称的特殊字符

时间:2018-06-29 08:50:48

标签: scala replace apache-spark-sql

我将输入spark-dataframe命名为df

+---------------+----------------+-----------------------+
|Main_CustomerID|126+ Concentrate|2.5 Ethylhexyl_Acrylate|
+---------------+----------------+-----------------------+
|         725153|             3.0|                    2.0|
|         873008|             4.0|                    1.0|
|         625109|             1.0|                    0.0|
+---------------+----------------+-----------------------+

我需要从df的列名中删除特殊字符,如下所示,

  • 删除+

  • 将空间替换为underscore

  • dot替换为underscore

所以我的df应该像

+---------------+---------------+-----------------------+
|Main_CustomerID|126_Concentrate|2_5_Ethylhexyl_Acrylate|
+---------------+---------------+-----------------------+
|         725153|            3.0|                    2.0|
|         873008|            4.0|                    1.0|
|         625109|            1.0|                    0.0|
+---------------+---------------+-----------------------+

使用Scala,我已经做到了,

var tableWithColumnsRenamed = df

for (field <- tableWithColumnsRenamed.columns) {
      tableWithColumnsRenamed = tableWithColumnsRenamed
        .withColumnRenamed(field, field.replaceAll("\\.", "_"))
    }
for (field <- tableWithColumnsRenamed.columns) {
      tableWithColumnsRenamed = tableWithColumnsRenamed
        .withColumnRenamed(field, field.replaceAll("\\+", ""))
    }
for (field <- tableWithColumnsRenamed.columns) {
      tableWithColumnsRenamed = tableWithColumnsRenamed
        .withColumnRenamed(field, field.replaceAll(" ", "_"))
    }

df = tableWithColumnsRenamed

当我使用时,

for (field <- tableWithColumnsRenamed.columns) {
      tableWithColumnsRenamed = tableWithColumnsRenamed
        .withColumnRenamed(field, field.replaceAll("\\.", "_"))
    .withColumnRenamed(field, field.replaceAll("\\+", ""))
    .withColumnRenamed(field, field.replaceAll(" ", "_"))
    }

我第一列的名称为126 Concentrate,而不是126_Concentrate

但是我不希望3 for循环来代替。我可以找到解决方案吗?

5 个答案:

答案 0 :(得分:4)

您可以按以下方式使用withColumnRenamed regex replaceAllInfoldLeft

val columns = df.columns

val regex = """[+._, ]+"""
val replacingColumns = columns.map(regex.r.replaceAllIn(_, "_"))

val resultDF = replacingColumns.zip(columns).foldLeft(df){(tempdf, name) => tempdf.withColumnRenamed(name._2, name._1)}

resultDF.show(false)

应该给您

+---------------+---------------+-----------------------+
|Main_CustomerID|126_Concentrate|2_5_Ethylhexyl_Acrylate|
+---------------+---------------+-----------------------+
|725153         |3.0            |2.0                    |
|873008         |4.0            |1.0                    |
|625109         |1.0            |0.0                    |
+---------------+---------------+-----------------------+

我希望答案会有所帮助

答案 1 :(得分:3)

df
  .columns
  .foldLeft(df){(newdf, colname) =>
    newdf.withColumnRenamed(colname, colname.replace(" ", "_").replace(".", "_"))
  }
  .show

答案 2 :(得分:0)

在Java中,您可以使用df.columns()遍历列名,并用string replaceAll(regexPattern, IntendedCharreplacement)替换每个标题字符串

然后使用withColumnRenamed(headerName, correctedHeaderName)重命名df标头。

例如-

for (String headerName : dataset.columns()) {
    String correctedHeaderName = headerName.replaceAll(" ","_").replaceAll("+","_");
    dataset = dataset.withColumnRenamed(headerName, correctedHeaderName);
}
dataset.show();

答案 3 :(得分:0)

Pi带Ramesh的答案,这是一个可重复使用的函数,该函数使用currying语法和.transform()方法并使列变小写:

dataset = [[int(y != 0) for y in ds] for ds in dataset]

答案 4 :(得分:0)

我们可以在使用 replaceAll 为各个字符替换特殊字符后,通过将 column_name 映射到新名称来删除所有字符,并且这行代码使用 spark scala 进行了尝试和测试。

df.select(
          df.columns
            .map(colName => col(s"`${colName}`").as(colName.replaceAll("\\.", "_").replaceAll(" ", "_"))): _*
        ).show(false)