pyspark在所有列名称中删除所有特殊字符的特殊字符

时间:2020-06-18 02:37:17

标签: pyspark

我正在尝试从所有列中删除所有特殊字符。我正在使用以下命令:-

df_spark = spark_df.select([F.col(col).alias(col.replace(' ', '_')) for col in df.columns])
df_spark1 = df_spark.select([F.col(col).alias(col.replace('%', '_')) for col in df_spark.columns])
df_spark = df_spark1.select([F.col(col).alias(col.replace(',', '_')) for col in df_spark1.columns])
df_spark1 = df_spark.select([F.col(col).alias(col.replace('(', '_')) for col in df_spark.columns])
df_spark2 = df_spark1.select([F.col(col).alias(col.replace(')', '_')) for col in df_spark1.columns])

有没有一种更简便的方法可以在一个命令中替换所有特殊字符(不仅是上面的5个字符)?我在Databricks上使用pyspark。

谢谢!

4 个答案:

答案 0 :(得分:0)

在Python中与list comprehension一起使用 re (正则表达式)模块。

Example:

df=spark.createDataFrame([('a b','ac','ac','ac','ab')],["i d","id,","i(d","i)k","i%j"])

df.columns
#['i d', 'id,', 'i(d', 'i)k', 'i%j']

import re

#replacing all the special characters using list comprehension
[re.sub('[\)|\(|\s|,|%]','',x) for x in df.columns]
#['id', 'id', 'id', 'ik', 'ij']

df.toDF(*[re.sub('[\)|\(|\s|,|%]','',x) for x in df.columns])
#DataFrame[id: string, id: string, id: string, ik: string, ij: string]

答案 1 :(得分:0)

您可以替换除A-z和0-9以外的任何字符

import re
df = df.select([F.col(col).alias(re.sub("[^0-9a-zA-Z$]+","",i)) for col in df.columns])

答案 2 :(得分:0)

downloadIcs203Data(): any { this.helper.toggleSidebarVisibility(true); this.apiService.downloadIcs203({downloadPDF : 1}).subscribe(data => { location.href = data.data.icspath; this.helper.toggleSidebarVisibility(false); swal.fire( '', data.message, 'success' ); }, (err: any) => { this.helper.toggleSidebarVisibility(false); swal.fire( 'Error', err.error.message, 'error' ); }); } 将标点符号和空格替换为 re.sub('[^\w]', '_', c) 下划线。

测试结果:

_

去除标点符号 + 用 from pyspark.sql import SparkSession import re spark = SparkSession.builder.getOrCreate() df = spark.createDataFrame([(1, 2, 3, 4)], [' 1', '%2', ',3', '(4)']) df = df.toDF(*[re.sub('[^\w]', '_', c) for c in df.columns]) df.show() # +---+---+---+---+ # | _1| _2| _3|_4_| # +---+---+---+---+ # | 1| 2| 3| 4| # +---+---+---+---+ 代替空格:

_

答案 3 :(得分:-1)

也许这很有用-

 // [^0-9a-zA-Z]+ => this will remove all special chars
    spark.range(2).withColumn("str", lit("abc%xyz_12$q"))
      .withColumn("replace", regexp_replace($"str", "[^0-9a-zA-Z]+", "_"))
      .show(false)

    /**
      * +---+------------+------------+
      * |id |str         |replace     |
      * +---+------------+------------+
      * |0  |abc%xyz_12$q|abc_xyz_12_q|
      * |1  |abc%xyz_12$q|abc_xyz_12_q|
      * +---+------------+------------+
      */

    // if you don't want to remove some special char like $ etc, include it [^0-9a-zA-Z$]+
    spark.range(2).withColumn("str", lit("abc%xyz_12$q"))
      .withColumn("replace", regexp_replace($"str", "[^0-9a-zA-Z$]+", "_"))
      .show(false)

    /**
      * +---+------------+------------+
      * |id |str         |replace     |
      * +---+------------+------------+
      * |0  |abc%xyz_12$q|abc_xyz_12$q|
      * |1  |abc%xyz_12$q|abc_xyz_12$q|
      * +---+------------+------------+
      */
相关问题